Here is what I came up with yesterday in the chatterbox while discussing theorbtwo's solution.
dr_jgbn confirmed that the shorter string is composed only of valid codons (sequences of three characters each representing a nucleatide base) and that introns will not be inserted into the middle of a codon (despite the example not meeting these criteria).
theorbtwo's solution would put the introns as soon as possible (which would not make the introns as long as possible) and this would lead to lots of backtracking before the solution was found. Consider 123456 vs. 1235432456 which theorbtwo's solution splits like 12(354)3(2)456 while changing (.*) to (.*?) would split like 123(5)4(324)56. But the longest possible intron is found like this 123(5432)456.
So I went with non-greedy (.*?) as it doesn't have to backtrack (except trivially) unless the strings don't actually match (I got the impression that the strings were already known to match, but I'll cover that later). And by keeping codons together, the splitting is sensible (it should match how a cell would interpret the DNA/RNA, I'd think).
which produces:#!/usr/bin/perl use strict; while( 1 ) { my $valid= <DATA> or exit( 0 ); chomp( $valid ); my $dirty= <DATA> or die "Missing second sequence"; chomp( $dirty ); $valid =~ s#(...)#(.*?)\Q$1\E#g; my @intron= $dirty =~ /^$valid(.*?)/; for my $intron ( @intron ) { next if ! length($intron); print "$intron\n"; } } __END__ ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT
GCACCG TAGG ATCTC TCG
If we aren't sure whether the strings actually do match, then you could use an experimental regex feature to disable backtracking at certain points (that we know is doomed to fail). But I'll just append the shorter string onto the end of the longer string. This ensures a match and it is easy to tell if the original string would not have matched:
#!/usr/bin/perl use strict; while( 1 ) { my $valid= <DATA> or exit( 0 ); chomp( $valid ); my $dirty= <DATA> or die "Missing second sequence"; chomp( $dirty ); print "Comparing $valid\nto $dirty\n"; my $dlen= length( $dirty ); $dirty .= $valid; $valid =~ s#(...)#(.*?)\Q$1\E#g; my @intron= $dirty =~ /^$valid(.*?)/ or die "Ouch! We backtracked"; if( $dlen < $+[0] ) { print "No match.\n"; } else { for my $intron ( @intron ) { next if ! length($intron); print "$intron\n"; } } } __END__ ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACT
Comparing ATGGAGTCGACGAATTTGAAGAAT to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT GCACCG TAGG ATCTC TCG Comparing ATGGAGTCGACGAATTTGAAGAAT to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACT No match.
Now I assume that these DNA/RNA strings can get very, very long. This solution should be pretty efficient but it would compile a very, very large regular expression. This might require too much memory. Luckily it is easy (and still efficient) to do the problem in chunks to reduce memory requirements.
which produces#!/usr/bin/perl use strict; my $maxCodons= 2; while( 1 ) { my $valid= <DATA> or exit( 0 ); chomp( $valid ); my $dirty= <DATA> or die "Missing second sequence"; chomp( $dirty ); print "Comparing $valid\nto $dirty\n"; my $dlen= length( $dirty ); $dirty .= $valid; my $pos= 0; while( "" ne $valid ) { my $part= substr( $valid, 0, 3*$maxCodons, "" ); $part =~ s#(...)#(.*?)\Q$1\E#g; pos($dirty)= $pos; my @intron= $dirty =~ /\G$part/ or die "Ouch! We backtracked"; $pos= $+[0]; if( $dlen < $pos ) { print "Match was not complete.\n"; $valid= ""; } else { for my $intron ( @intron ) { next if ! length($intron); print "$intron\n"; } } if( "" eq $valid && $pos < $dlen ) { print substr( $dirty, $pos, $dlen-$pos ), $/; } } } __END__ ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAATTGG ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACTTGG
Comparing ATGGAGTCGACGAATTTGAAGAAT to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAATTGG GCACCG TAGG ATCTC TCG TGG Comparing ATGGAGTCGACGAATTTGAAGAAT to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACTTGG GCACCG TAGG ATCTC Match was not complete.
If you set $maxCodons to 1 then you end up doing about the same as this index version:
which produces the same results.#!/usr/bin/perl use strict; while( 1 ) { my $valid= <DATA> or exit( 0 ); chomp( $valid ); my $dirty= <DATA> or die "Missing second sequence"; chomp( $dirty ); print "Comparing $valid\nto $dirty\n"; my $pos= 0; while( "" ne $valid ) { my $codon= substr( $valid, 0, 3, "" ); my $next= index( $dirty, $codon, $pos ); if( $next < 0 ) { print "Match was not complete.\n"; last; } if( $pos < $next ) { print substr( $dirty, $pos, $next-$pos ), $/; } $pos= $next + 3; if( "" eq $valid && $pos < length($dirty) ) { print substr( $dirty, $pos ), $/; } } } __END__ ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAATTGG ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACTTGG
I think I may have one or more off-by-one errors left in the above code despite the quick tests seeming to indicate otherwise. (:
- tyeIn reply to Re: Complicated pattern match (from CB)
by tye
in thread Complicated pattern match
by dr_jgbn
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |