PerlKc has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, I need to print the sequence starting with ATG and ending with TGA,TAA,TAG. My code prints the two sets of sequence ending with TAA only, I am having trouble printing sequence ending with TGA as well, as there is TGA codon in the original sequence.

#!/usr/bin/perl #FindCoding.pl use warnings; use strict; use diagnostics; my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA' +; while ($sequence =~ /(ATG.*?(?:TAG|TAA|TGA))/g){ print "$1\n"; }

The output is: ATGGTTTCTCCCATCTCTCCATCGGCATAA and ATGATCTAA However I am looking for sequence ending with TGA codon as well.

Replies are listed 'Best First'.
Re: Print A Sequence with Start codon and different Stop Codon
by choroba (Cardinal) on Oct 27, 2015 at 23:18 UTC
    It's not clear what output you expect. To search for overlapping sequences, you can change the second group from non-grouping to a look-behind:
    $sequence =~ /(ATG.*?(?<=TAA|TAG|TGA))/g

    to get

    ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGA

    It still extracts the shortest possible sequence for each starting point (so we lost the second output).

    Update: It's possible to get all the sequences without experimental regex features and depending on the return value of print like here.

    my @from; my $pos = -1; push @from, $pos while -1 != ($pos = index $sequence, 'ATG', $pos + 1) +; my @to; for my $end (qw( TAA TAG TGA )) { $pos = -1; push @to, $pos + 3 while -1 != ($pos = index $sequence, $end, $pos + + 1); } for my $f (@from) { for my $t (@to) { say substr $sequence, $f, $t - $f if $t > $f; } } __END__ Output: ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA ATGATCTAA ATGA
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Yes, but it was so much fun :)

      #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1146191 use strict; use warnings; my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA' +; while( $sequence =~ /ATG/g ) { my $rest = $'; print 'ATG' . $` . $1 while $rest =~ /(TAG|TAA|TGA)/g; }

        I tried that, but my output should be set of sequences with start codon ATG and end codon TAG,TAA,TGA. For example ATG...............TAA ATG...........TAG ATG.........................TGA ATG.................................TAA .......represents sequence in middle of start and stop codon

      I tried that, but my output should be set of sequences with start codon ATG and end codon TAG,TAA,TGA. For example ATG...............TAA ATG...........TAG ATG.........................TGA ATG.................................TAA .......represents sequence in middle of start and stop codon I am looking for regex features to get the output. Thanks

Re: Print A Sequence with Start codon and different Stop Codon
by Anonymous Monk on Oct 27, 2015 at 23:21 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1146191 use strict; use warnings; my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA' +; $sequence =~ /(ATG.*?(?:TAG|TAA|TGA))(??{print "$1\n"})/;

    which prints:

    ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ATGATCTAA

      Output is right on dot, but couldn't get it when I used the code you posted.

      THanks Bunch, It Works. YAYYYYYYYY! I have a quick question what does the ?? before the print command does?

        See "(??{ code })" in perldoc perlre

Re: Print A Sequence with Start codon and different Stop Codon
by ww (Archbishop) on Oct 28, 2015 at 18:16 UTC

    Smells a lot like homework!

    Compare: Regular expressions (another author with an almost identical problem -- or a SOPW posting under two handles?)

    OP: when you post homework, mark it as such! You won't really learn much if some of our (less-than-discreet) Monks simply hands you an answer.