in reply to Using Recursion to Find DNA Sequences

my $regex = '\w+(ATG\w*T(AG|AA|GA))\w+';

Going wrong:

Update: Several small edits for clarity and de-typo-ization.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: Using Recursion to Find DNA Sequences
by clueless_perl (Initiate) on Oct 29, 2017 at 15:13 UTC
    Thank you for pointing out my unclear question. What I mean by all DNA sequences I mean any coding region from the original DNA sequence, so anything that begins with a start codon (ATG) and ends with stop codon (TAG, TAA, or TGA). My problem is that some of these coding regions overlap in the original sequence so when I just try to match the sequence to the regex I only get one sequence, not the overlapping ones. There should be a total of 4 coding regions in the original sequence. I originally had tried changing my regex to ATG\w*T(AG|AA|GA) but I still get deep recursion. Wouldn't that regex just be matching the coding regions I want? When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly?

      Try something like:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'xABCxABCxxxWXYxWXZxxxABCxxWXYx'; ;; my $subseq = qr{ ABC \w* (?: WXY | WXZ) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ABCxABCxxxWXYxWXZxxxABCxxWXY", 1], ["ABCxABCxxxWXYxWXZ", 1], ["ABCxABCxxxWXY", 1], ["ABCxxxWXYxWXZxxxABCxxWXY", 5], ["ABCxxxWXYxWXZ", 5], ["ABCxxxWXY", 5], ["ABCxxWXY", 21], ]
      (I'm just using  ...ABCxxWXY... to make the permutations and overlaps clear.) (Update: The number that is the second item in each array reference returned is the base-0 offset of the start of the matching subsequence.)

      Update: Using your original sequence:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG | TAA | TGA) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ]
      This works with Perl 5.8+ regexes. What version of Perl are you using — it might make a difference in future?

      Update 2: Remembering that DNA sequences may sometimes be loooong, it may be advantageous to pass the sequence by reference. Note that both the function and the function invocation must change.

      c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG | TAA | TGA) }xms; ;; my @all = find_all(\$seq, $subseq); dd \@all; ;; ;; sub find_all { my ($sr_seq, $regex) = @_; ;; local our @hits; use re 'eval'; $$sr_seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ]
      Still runs under Perl 5.8.


      Give a man a fish:  <%-{-{-{-<

      When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly?

      I don't see any need for recursion at all. As far as I can see, everything you want to do can be done in a properly constituted regex. There may be speed considerations that make a different approach preferable, but they can only be addressed by careful specification and benchmarking.

      Update: To directly address your question about recursion:

      sub find_coding { my ( $seq ) = @_; while ( $seq =~ /($regex)/g ) { my $match = $1; my $sequence = find_coding($match); my $allmatch = find_coding($sequence); print "Coding region: $allmatch", "\n"; } }
      When you find any match of  $seq against the regex, you take the matching subsequence  $1 and feed it back to the function again:
          my $match = $1;
          my $sequence = find_coding($match);
      Of course, this subsequence matches again because it already matched, and we're off to infinity... and beyond!

      Note also that the
          my $sequence = find_coding($match);
      statement and the following
          my $allmatch = find_coding($sequence);
      statement (were it ever to be executed) are meaningless because the  find_coding() function doesn't return anything meaningful; as best I can see, it would return (if it ever did) the return value of the print statement at the end of the while-loop. (Update: Changed this paragraph to address both statements.)

      Update:

      ...the  find_coding() function ... would return ... the return value of the print statement at the end of the while-loop.
      On second thought, I take this back. Without an explicit return statement, a function returns the value of the last expression executed in the function. My idea was that the print statement at the end of the while-loop would be that last expression. In fact, if the function didn't infinitely recurse, the last expression executed in the function would be the  $seq =~ /($regex)/g regex in the conditional of the while-loop: when this expression evaluates false, the loop will exit and, as there is no code in the function after the loop, the function will exit and return this ever-false value.


      Give a man a fish:  <%-{-{-{-<