Using Recursion to Find DNA Sequences

clueless_perl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Using Recursion to Find DNA Sequences by AnomalousMonk (Archbishop) on Oct 29, 2017 at 15:05 UTC
`my $regex = '\w+(ATG\wT(AG\|AA\|GA))\w+';`* Going wrong: The `$regex` you have defined matches the entire string! You then take this match (the entire string) and feed it to `find_coding()` again: deep recursion. (Update: See this for more detailed discussion of recursion problems.) `c:\@Work\Perl\monks>perl -wMstrict -le "my $inputseq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; print qq{'$inputseq'}; ;; my $regex = '\w+(ATG\wT(AG\|AA\|GA))\w+'; $inputseq =~ /($regex)/g; print qq{'$1'}; " 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA' 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'` [download] A small point: Please don't define regexes as strings: A number of subtle bugs can be encountered. Use `qr//` (see perlop). Another small point: Please try to avoid using capturing groups in `qr//` objects (especially absolutely unnecessarily, as in this case). It makes counting groups at the top level a headache. Please define what you mean by "... all possible DNA sequences ..."* Update: Several small edits for clarity and de-typo-ization. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Using Recursion to Find DNA Sequences by clueless_perl (Initiate) on Oct 29, 2017 at 15:13 UTC
Thank you for pointing out my unclear question. What I mean by all DNA sequences I mean any coding region from the original DNA sequence, so anything that begins with a start codon (ATG) and ends with stop codon (TAG, TAA, or TGA). My problem is that some of these coding regions overlap in the original sequence so when I just try to match the sequence to the regex I only get one sequence, not the overlapping ones. There should be a total of 4 coding regions in the original sequence. I originally had tried changing my regex to ATG\w*T(AG\|AA\|GA) but I still get deep recursion. Wouldn't that regex just be matching the coding regions I want? When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly?	[reply]
Re^3: Using Recursion to Find DNA Sequences by AnomalousMonk (Archbishop) on Oct 29, 2017 at 15:39 UTC
Try something like: c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'xABCxABCxxxWXYxWXZxxxABCxxWXYx'; ;; my $subseq = qr{ ABC \w* (?: WXY \| WXZ) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ABCxABCxxxWXYxWXZxxxABCxxWXY", 1], ["ABCxABCxxxWXYxWXZ", 1], ["ABCxABCxxxWXY", 1], ["ABCxxxWXYxWXZxxxABCxxWXY", 5], ["ABCxxxWXYxWXZ", 5], ["ABCxxxWXY", 5], ["ABCxxWXY", 21], ] [download] (I'm just using `...ABCxxWXY...` to make the permutations and overlaps clear.) (Update: The number that is the second item in each array reference returned is the base-0 offset of the start of the matching subsequence.) Update: Using your original sequence: c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG \| TAA \| TGA) }xms; ;; my @all = find_all($seq, $subseq); dd \@all; ;; ;; sub find_all { my ($seq, $regex) = @_; ;; local our @hits; use re 'eval'; $seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ] [download] This works with Perl 5.8+ regexes. What version of Perl are you using — it might make a difference in future? Update 2: Remembering that DNA sequences may sometimes be loooong, it may be advantageous to pass the sequence by reference. Note that both the function and the function invocation must change. c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(dd); ;; my $seq = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA'; ;; my $subseq = qr{ ATG \w* (?: TAG \| TAA \| TGA) }xms; ;; my @all = find_all(\$seq, $subseq); dd \@all; ;; ;; sub find_all { my ($sr_seq, $regex) = @_; ;; local our @hits; use re 'eval'; $$sr_seq =~ m{ ($regex) (?{ push @hits, [ $^N, $-[1] ] }) (?!) }xmsg; ;; return @hits; } " [ ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA", 1], ["ATGGTTTCTCCCATCTCTCCATCGGCATAA", 1], ["ATGATCTAA", 40], ] [download] Still runs under Perl 5.8. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Using Recursion to Find DNA Sequences by AnomalousMonk (Archbishop) on Oct 29, 2017 at 16:19 UTC
When I got deep recursion, I tried changing my regex, but I see that I just made it match the entire string. Is the problem still with my regex or am I using recursion incorrectly? I don't see any need for recursion at all. As far as I can see, everything you want to do can be done in a properly constituted regex. There may be speed considerations that make a different approach preferable, but they can only be addressed by careful specification and benchmarking. Update: To directly address your question about recursion: `sub find_coding { my ( $seq ) = @_; while ( $seq =~ /($regex)/g ) { my $match = $1; my $sequence = find_coding($match); my $allmatch = find_coding($sequence); print "Coding region: $allmatch", "\n"; } }` [download] When you find any match of `$seq` against the regex, you take the matching subsequence `$1` and feed it back to the function again: `my $match = $1;` `my $sequence = find_coding($match);` Of course, this subsequence matches again because it already matched, and we're off to infinity... and beyond! Note also that the `my $sequence = find_coding($match);` statement and the following `my $allmatch = find_coding($sequence);` statement (were it ever to be executed) are meaningless because the `find_coding()` function doesn't return anything meaningful; as best I can see, it would return (if it ever did) the return value of the print statement at the end of the `while`-loop. (Update: Changed this paragraph to address both statements.) Update: ...the `find_coding()` function ... would return ... the return value of the `print` statement at the end of the `while`-loop. On second thought, I take this back. Without an explicit return statement, a function returns the value of the last expression executed in the function. My idea was that the print statement at the end of the `while`-loop would be that last expression. In fact, if the function didn't infinitely recurse, the last expression executed in the function would be the `$seq =~ /($regex)/g` regex in the conditional of the `while`-loop: when this expression evaluates false, the loop will exit and, as there is no code in the function after the loop, the function will exit and return this ever-false value. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Using Recursion to Find DNA Sequences by Cristoforo (Curate) on Oct 29, 2017 at 18:56 UTC
Here is an approach that will get the results. It uses more of an iterative method. `#!/usr/bin/perl use strict; use warnings; my $seq = "AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA"; my @beg; push @beg, $-[0] while $seq =~ /ATG/g; my @end; push @end, $+[0] while $seq =~ /T(?:AG\|AA\|GA)/g; my @data; for my $i (@beg) { for my $j (@end) { next if $j-$i < 6; push @data, substr $seq, $i, $j-$i; } } print "$_\n" for @data;` [download] Output: `ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ATGATCTAA` [download]	[reply] [d/l] [select]
Re: Using Recursion to Find DNA Sequences by johngg (Canon) on Oct 29, 2017 at 15:31 UTC
Please could you confirm that these are the matches that you are hoping for. `AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ^^^------------------------^^^ ^^^-------------------------------------^^^ ^^^------------------------------------------^^^ ^^^---^^^` [download] Those are the ones I can spot using just my eyes but I might be misunderstanding your rules. Cheers, JohnGG	[reply] [d/l]
Re^2: Using Recursion to Find DNA Sequences by clueless_perl (Initiate) on Oct 29, 2017 at 18:43 UTC
Thanks everyone! The matches that I am looking for (which JohnGG correctly highlighted) are: ATGGTTTCTCCCATCTCTCCATCGGCATAA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA ATGATCTAA	[reply]
Re^3: Using Recursion to Find DNA Sequences by AnomalousMonk (Archbishop) on Oct 29, 2017 at 19:10 UTC
To get the shortest-to-longest length ordering from the approach I outlined here, just use a `?` "lazy" modifier on the quantifier of the `\w` "padding" sub-pattern making it `\w?`: `qr{ ATG \w*? (?: TAG \| TAA \| TGA) }xms;` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]