in reply to Bioinformatics: Regex loop, no output

The following uses look behind (?<=...) with a match set [KR]and negative look ahead (?!P) that rejects a "following P" match in a split to slice up the protein:

use strict; use warnings; my @proteins = qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD DAAAAATTLTTTAMTTTTTTCK ); for my $protein (@proteins) { my @peptides = split /(?<=[KR])(?!P)/, $protein; next if @peptides < 2; print "Protein: $protein\n"; print "Peptides:\n"; print " $_\n" for @peptides; }

Prints:

Protein: DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG Peptides: DAAAAATTLTTTAMTTTTTTCK MMFRPPPPPGGGGGGGGGGGG Protein: ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD Peptides: ALTAMCMNVWEITYHK GSDVNR R ASFAQPPPQPPPPLLAIKPASDASD
Premature optimization is the root of all job security

Replies are listed 'Best First'.
Re^2: Bioinformatics: Regex loop, no output
by TamaDP (Initiate) on Nov 16, 2015 at 13:15 UTC
    Thanks all! Working now, and I also added more enzymes, selectable by getopts. Peptides printing alright, however, how could I track which peptide comes from which protein in order to make the printing a little more organised? I.e: Protein 1 Peptide 1 Protein 1 Peptide 2 ... Protein n Pepptide x
      In what way does GrandFather's solution (which you replied to) not do what you want?

        Sorry, I dind't explain myself very well there. So, I have an array of proteins that are digested with one enzyme that the user selects. I get an array of peptides, and I want to send that array to a subroutine for printing, where the printing comes out as:

        >Protein 1 Peptide 1 DAAAAATTLTTTAMTTTTTTCK >Protein 1 Peptide 2 MMFRPPPPPGGGGGGGGGGGG >Protein 2 Peptide 1 ALTAMCMNVWEITYHK

        And so on... So in order to format the printing like that, I need to track which peptide belongs to each protein. Or am I making things more complicated than necessary? Thx