clueless_perl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am having some trouble using recursion in order to find all possible DNA sequences from a sequence (AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA). Each time I try to run my code I get stuck in deep recursion. Here is what I have...can anyone tell me where I am going wrong?
#!/usr/bin/perl use strict; use warnings; my $inputseq = "AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA"; my $regex = '\w+(ATG\w*T(AG|AA|GA))\w+'; find_coding($inputseq); sub find_coding { my ( $seq ) = @_; while ( $seq =~ /($regex)/g ) { my $match = $1; my $sequence = find_coding($match); my $allmatch = find_coding($sequence); print "Coding region: $allmatch", "\n"; } }
Thanks in advance!

Replies are listed 'Best First'.
Re: Using Recursion to Find DNA Sequences
by AnomalousMonk (Archbishop) on Oct 29, 2017 at 15:05 UTC
    my $regex = '\w+(ATG\w*T(AG|AA|GA))\w+';

    Going wrong:

    • The  $regex you have defined matches the entire string! You then take this match (the entire string) and feed it to  find_coding() again: deep recursion. (Update: See this for more detailed discussion of recursion problems.)
      c:\@Work\Perl\monks>perl -wMstrict -le "my $inputseq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; print qq{'$inputseq'}; ;; my $regex = '\w+(ATG\w*T(AG|AA|GA))\w+'; $inputseq =~ /($regex)/g; print qq{'$1'}; " 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA' 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'
    • A small point: Please don't define regexes as strings: A number of subtle bugs can be encountered. Use  qr// (see perlop).
    • Another small point: Please try to avoid using capturing groups in  qr// objects (especially absolutely unnecessarily, as in this case). It makes counting groups at the top level a headache.
    • Please define what you mean by "... all possible DNA sequences ..."

    Update: Several small edits for clarity and de-typo-ization.


    Give a man a fish:  <%-{-{-{-<

      Thank you for pointing out my unclear question. What I mean by all DNA sequences I mean any coding region from the original DNA sequence, so anything that begins with a start codon (ATG) and ends with stop codon (TAG, TAA, or TGA). My problem is that some of these coding regions overlap in the original sequence so when I just try to match the sequence to the regex I only get one sequence, not the overlapping ones. There should be a total of 4 coding regions in the original sequence. I originally had tried changing my regex to ATG\w*T(AG|AA|GA) but I still get deep recursion. Wouldn't that regex just be matching the coding regions I want? When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly?

        Try something like:

        c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'xABCxABCxxxWXYxWXZxxxABCxxWXYx'; ;; my $subseq = qr{ ABC \w* (?: WXY | WXZ) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ABCxABCxxxWXYxWXZxxxABCxxWXY", 1], ["ABCxABCxxxWXYxWXZ", 1], ["ABCxABCxxxWXY", 1], ["ABCxxxWXYxWXZxxxABCxxWXY", 5], ["ABCxxxWXYxWXZ", 5], ["ABCxxxWXY", 5], ["ABCxxWXY", 21], ]
        (I'm just using  ...ABCxxWXY... to make the permutations and overlaps clear.) (Update: The number that is the second item in each array reference returned is the base-0 offset of the start of the matching subsequence.)

        Update: Using your original sequence:

        c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG | TAA | TGA) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ]
        This works with Perl 5.8+ regexes. What version of Perl are you using — it might make a difference in future?

        Update 2: Remembering that DNA sequences may sometimes be loooong, it may be advantageous to pass the sequence by reference. Note that both the function and the function invocation must change.

        c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG | TAA | TGA) }xms; ;; my @all = find_all(\$seq, $subseq); dd \@all; ;; ;; sub find_all { my ($sr_seq, $regex) = @_; ;; local our @hits; use re 'eval'; $$sr_seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ]
        Still runs under Perl 5.8.


        Give a man a fish:  <%-{-{-{-<

        When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly?

        I don't see any need for recursion at all. As far as I can see, everything you want to do can be done in a properly constituted regex. There may be speed considerations that make a different approach preferable, but they can only be addressed by careful specification and benchmarking.

        Update: To directly address your question about recursion:

        sub find_coding { my ( $seq ) = @_; while ( $seq =~ /($regex)/g ) { my $match = $1; my $sequence = find_coding($match); my $allmatch = find_coding($sequence); print "Coding region: $allmatch", "\n"; } }
        When you find any match of  $seq against the regex, you take the matching subsequence  $1 and feed it back to the function again:
            my $match = $1;
            my $sequence = find_coding($match);
        Of course, this subsequence matches again because it already matched, and we're off to infinity... and beyond!

        Note also that the
            my $sequence = find_coding($match);
        statement and the following
            my $allmatch = find_coding($sequence);
        statement (were it ever to be executed) are meaningless because the  find_coding() function doesn't return anything meaningful; as best I can see, it would return (if it ever did) the return value of the print statement at the end of the while-loop. (Update: Changed this paragraph to address both statements.)

        Update:

        ...the  find_coding() function ... would return ... the return value of the print statement at the end of the while-loop.
        On second thought, I take this back. Without an explicit return statement, a function returns the value of the last expression executed in the function. My idea was that the print statement at the end of the while-loop would be that last expression. In fact, if the function didn't infinitely recurse, the last expression executed in the function would be the  $seq =~ /($regex)/g regex in the conditional of the while-loop: when this expression evaluates false, the loop will exit and, as there is no code in the function after the loop, the function will exit and return this ever-false value.


        Give a man a fish:  <%-{-{-{-<

Re: Using Recursion to Find DNA Sequences
by Cristoforo (Curate) on Oct 29, 2017 at 18:56 UTC
    Here is an approach that will get the results. It uses more of an iterative method.
    #!/usr/bin/perl use strict; use warnings; my $seq = "AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA"; my @beg; push @beg, $-[0] while $seq =~ /ATG/g; my @end; push @end, $+[0] while $seq =~ /T(?:AG|AA|GA)/g; my @data; for my $i (@beg) { for my $j (@end) { next if $j-$i < 6; push @data, substr $seq, $i, $j-$i; } } print "$_\n" for @data;
    Output:
    ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ATGATCTAA
Re: Using Recursion to Find DNA Sequences
by johngg (Canon) on Oct 29, 2017 at 15:31 UTC

    Please could you confirm that these are the matches that you are hoping for.

    AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ^^^------------------------^^^ ^^^-------------------------------------^^^ ^^^------------------------------------------^^^ ^^^---^^^

    Those are the ones I can spot using just my eyes but I might be misunderstanding your rules.

    Cheers,

    JohnGG

      Thanks everyone! The matches that I am looking for (which JohnGG correctly highlighted) are: ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ATGATCTAA

        To get the shortest-to-longest length ordering from the approach I outlined here, just use a  ? "lazy" modifier on the quantifier of the  \w* "padding" sub-pattern making it  \w*?:
            qr{ ATG \w*? (?: TAG | TAA | TGA) }xms;


        Give a man a fish:  <%-{-{-{-<