Cantello has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I am trying to split a string into different parts, based on content (a protein sequence digested by enzymes cutting at different amino acids). I have tried:
my $a = "AAAAXRXAAAAAXKXAAAAAXRPXAAA" # sample sequence @digested0 = split(/[KR](?!P)/, $orf); # cut at every K or R, except i +f followed by a P print join("\n",@digested0);
The output is:
AAAAX XAAAAAX XAAAAAXRPXAAA
It's basically correct but it should be
AAAAXR XAAAAAXK XAAAAAXRPXAAA
i.e. keeping the separator "K" or "R". How could I achieve this? There probably is a regexp for that as well, I'm just too blind to find it... Thanks! :-)

Replies are listed 'Best First'.
Re: Splitting string (regex) while keeping the separator?
by moritz (Cardinal) on May 03, 2010 at 14:08 UTC
    If you want to have the delimiter as a separate list item, then use capturing parenthesis around it.

    If you want to have it attached to the previous item, you need to make it a zero-width look-behind:

    split(/(?<=[KR])(?!P)/, $a);

    Also you have a few errors in your script that prevent the snippet you posted from producing the output you show. (Mis-named variables, missing semicolon).

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Splitting string (regex) while keeping the separator?
by AnomalousMonk (Archbishop) on May 03, 2010 at 16:04 UTC

    Note that split passes through the original string if the split pattern never matches. This condition should probably be checked if not wanted.

    >perl -wMstrict -le "my $orf = 'AAAAXXAAAAAXXAAAAAXPXAAA'; my @digested0 = split /(?<=[KR](?!P))/, $orf; print for @digested0; " AAAAXXAAAAAXXAAAAAXPXAAA

    Here's an  m// solution without this drawback:

    >perl -wMstrict -le "my $a = 'AAAAXRXAAAAAXKXAAAAAXRPXAAA'; my $cut = qr{ [KR] (?!P) }xms; my $digest = qr{ .*? $cut | (?<= $cut) .* \z }xms; my @digested = $a =~ m{ $digest }xmsg; print qq{'$_'} for @digested; $a = 'AAAXXXPPPAAAXXXAAAPPPAAA'; @digested = $a =~ m{ $digest }xmsg; print '@digested empty' unless @digested; " 'AAAAXR' 'XAAAAAXK' 'XAAAAAXRPXAAA' @digested empty
Re: Splitting string (regex) while keeping the separator?
by Anonymous Monk on May 03, 2010 at 14:50 UTC
    $ perl -le' my $orf = "AAAAXRXAAAAAXKXAAAAAXRPXAAA"; my @digested0 = split /(?<=[KR](?!P))/, $orf; print for @digested0; ' AAAAXR XAAAAAXK XAAAAAXRPXAAA
Re: Splitting string (regex) while keeping the separator?
by Sinistral (Monsignor) on May 03, 2010 at 20:17 UTC

    If you haven't already heard of the Bioperl site you should head on over there and see about already optimized and tested solutions for the type of task you are trying to accomplish. In particular (I am not a bio-person), you should look on Bioperl scripts page.

      Thanks guys, that was really awesome help (even though I made some mistakes asking; sorry for that)! I will work through the examples you've given and also head over to the bioperl site. Thanks again... :-)
Re: Splitting string (regex) while keeping the separator?
by Anonymous Monk on May 03, 2010 at 14:09 UTC
    Read perlintro and use capturing parentheses.