in reply to Re^3: Bioinformatics: Regex loop, no output
in thread Bioinformatics: Regex loop, no output

Sorry, I dind't explain myself very well there. So, I have an array of proteins that are digested with one enzyme that the user selects. I get an array of peptides, and I want to send that array to a subroutine for printing, where the printing comes out as:

>Protein 1 Peptide 1 DAAAAATTLTTTAMTTTTTTCK >Protein 1 Peptide 2 MMFRPPPPPGGGGGGGGGGGG >Protein 2 Peptide 1 ALTAMCMNVWEITYHK

And so on... So in order to format the printing like that, I need to track which peptide belongs to each protein. Or am I making things more complicated than necessary? Thx

Replies are listed 'Best First'.
Re^5: Bioinformatics: Regex loop, no output
by AnomalousMonk (Archbishop) on Nov 16, 2015 at 16:11 UTC

    Again, GrandFather's code above already seems to print the information in essentially the way you want, except the formatting is different. So change the print formatting. Is this what you need help with?

    On the other hand, you may mean that you want the peptides encapsulated into an independent data structure that you can pass around to any function at will. Here's an adaptation of GrandFather's code to produce a data structure associating proteins with their split peptides:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my @proteins = qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD DAAAAATTLTTTAMTTTTTTCK XXXXXXX ); ;; my %protein_peptides; ;; for my $protein (@proteins) { my @peptides = split /(?<=[KR])(?!P)/, $protein; ;; next if @peptides < 2; ;; push @{ $protein_peptides{$protein} }, \@peptides } ;; dd \%protein_peptides; " { ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD => [ ["ALTAMCMNVWEITYHK", "GSDVNR", "R", "ASFAQPPPQPPPPLLAIKPASDASD"], ], DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG => [ ["DAAAAATTLTTTAMTTTTTTCK", "MMFRPPPPPGGGGGGGGGGGG"] ], }
    I have reformatted the native output of Data::Dump::dd() as it appeared on my monitor to make it more readable. (Update: I like Data::Dump as my dumper, but you may prefer Data::Dumper, which is core.)

    Note that the protein  DAAAAATTLTTTAMTTTTTTCK does not appear in the output data structure because, while it ends in a K that is not followed by a P and so might in some cases be considered to be followed by an empty (or null) string, split will not produce trailing null fields when called as it is in the code. (Update: Therefore,  DAAAAATTLTTTAMTTTTTTCK is considered not to have been split at all, and so does not appear in the output structure.) See split for the rules about producing null trailing (and leading) fields. Note also that the protein  XXXXXXX does not appear in the output structure because it contains no split point whatsoever.

    See Perl Data Structures Cookbook (perldsc) for more info on generating and accessing complex Perl data structures.


    Give a man a fish:  <%-{-{-{-<

      Many thanks for the help, seems to be working right now. Definitely need to learn more about cpan.