Re: Using Recursion to Find DNA Sequences

my $regex = '\w+(ATG\w*T(AG|AA|GA))\w+';

Going wrong:

The $regex you have defined matches the entire string! You then take this match (the entire string) and feed it to find_coding() again: deep recursion. (Update: See this for more detailed discussion of recursion problems.)

c:\@Work\Perl\monks>perl -wMstrict -le
"my $inputseq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA';
 print qq{'$inputseq'};
 ;;
 my $regex = '\w+(ATG\w*T(AG|AA|GA))\w+';
 $inputseq =~ /($regex)/g;
 print qq{'$1'};
"
'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'
'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'
[download]

A small point: Please don't define regexes as strings: A number of subtle bugs can be encountered. Use qr// (see perlop).
Another small point: Please try to avoid using capturing groups in qr// objects (especially absolutely unnecessarily, as in this case). It makes counting groups at the top level a headache.
Please define what you mean by "... all possible DNA sequences ..."

Update: Several small edits for clarity and de-typo-ization.

Give a man a fish: <%-{-{-{-<

Comment on Re: Using Recursion to Find DNA Sequences Select or Download Code

Replies are listed 'Best First'.
Re^2: Using Recursion to Find DNA Sequences by clueless_perl (Initiate) on Oct 29, 2017 at 15:13 UTC
Thank you for pointing out my unclear question. What I mean by all DNA sequences I mean any coding region from the original DNA sequence, so anything that begins with a start codon (ATG) and ends with stop codon (TAG, TAA, or TGA). My problem is that some of these coding regions overlap in the original sequence so when I just try to match the sequence to the regex I only get one sequence, not the overlapping ones. There should be a total of 4 coding regions in the original sequence. I originally had tried changing my regex to ATG\w*T(AG\|AA\|GA) but I still get deep recursion. Wouldn't that regex just be matching the coding regions I want? When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly?	[reply]
Re^3: Using Recursion to Find DNA Sequences by AnomalousMonk (Archbishop) on Oct 29, 2017 at 15:39 UTC
Try something like: c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'xABCxABCxxxWXYxWXZxxxABCxxWXYx'; ;; my $subseq = qr{ ABC \w* (?: WXY \| WXZ) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ABCxABCxxxWXYxWXZxxxABCxxWXY", 1], ["ABCxABCxxxWXYxWXZ", 1], ["ABCxABCxxxWXY", 1], ["ABCxxxWXYxWXZxxxABCxxWXY", 5], ["ABCxxxWXYxWXZ", 5], ["ABCxxxWXY", 5], ["ABCxxWXY", 21], ] [download] (I'm just using `...ABCxxWXY...` to make the permutations and overlaps clear.) (Update: The number that is the second item in each array reference returned is the base-0 offset of the start of the matching subsequence.) Update: Using your original sequence: c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG \| TAA \| TGA) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ] [download] This works with Perl 5.8+ regexes. What version of Perl are you using — it might make a difference in future? Update 2: Remembering that DNA sequences may sometimes be loooong, it may be advantageous to pass the sequence by reference. Note that both the function and the function invocation must change. c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG \| TAA \| TGA) }xms; ;; my @all = find_all(\$seq, $subseq); dd \@all; ;; ;; sub find_all { my ($sr_seq, $regex) = @_; ;; local our @hits; use re 'eval'; $$sr_seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ] [download] Still runs under Perl 5.8. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Using Recursion to Find DNA Sequences by AnomalousMonk (Archbishop) on Oct 29, 2017 at 16:19 UTC
When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly? I don't see any need for recursion at all. As far as I can see, everything you want to do can be done in a properly constituted regex. There may be speed considerations that make a different approach preferable, but they can only be addressed by careful specification and benchmarking. Update: To directly address your question about recursion: `sub find_coding { my ( $seq ) = @_; while ( $seq =~ /($regex)/g ) { my $match = $1; my $sequence = find_coding($match); my $allmatch = find_coding($sequence); print "Coding region: $allmatch", "\n"; } }` [download] When you find any match of `$seq` against the regex, you take the matching subsequence `$1` and feed it back to the function again: `my $match = $1;` `my $sequence = find_coding($match);` Of course, this subsequence matches again because it already matched, and we're off to infinity... and beyond! Note also that the `my $sequence = find_coding($match);` statement and the following `my $allmatch = find_coding($sequence);` statement (were it ever to be executed) are meaningless because the `find_coding()` function doesn't return anything meaningful; as best I can see, it would return (if it ever did) the return value of the print statement at the end of the `while`-loop. (Update: Changed this paragraph to address both statements.) Update: ...the `find_coding()` function ... would return ... the return value of the `print` statement at the end of the `while`-loop. On second thought, I take this back. Without an explicit return statement, a function returns the value of the last expression executed in the function. My idea was that the print statement at the end of the `while`-loop would be that last expression. In fact, if the function didn't infinitely recurse, the last expression executed in the function would be the `$seq =~ /($regex)/g` regex in the conditional of the `while`-loop: when this expression evaluates false, the loop will exit and, as there is no code in the function after the loop, the function will exit and return this ever-false value. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]