comment on

Here is what I came up with yesterday in the chatterbox while discussing theorbtwo's solution.

dr_jgbn confirmed that the shorter string is composed only of valid codons (sequences of three characters each representing a nucleatide base) and that introns will not be inserted into the middle of a codon (despite the example not meeting these criteria).

theorbtwo's solution would put the introns as soon as possible (which would not make the introns as long as possible) and this would lead to lots of backtracking before the solution was found. Consider 123456 vs. 1235432456 which theorbtwo's solution splits like 12(354)3(2)456 while changing (.*) to (.*?) would split like 123(5)4(324)56. But the longest possible intron is found like this 123(5432)456.

So I went with non-greedy (.*?) as it doesn't have to backtrack (except trivially) unless the strings don't actually match (I got the impression that the strings were already known to match, but I'll cover that later). And by keeping codons together, the splitting is sensible (it should match how a cell would interpret the DNA/RNA, I'd think).

#!/usr/bin/perl
use strict;

while(  1  ) {
    my $valid= <DATA>  or  exit( 0 );
    chomp( $valid );
    my $dirty= <DATA>  or  die "Missing second sequence";
    chomp( $dirty );
    $valid =~ s#(...)#(.*?)\Q$1\E#g;
    my @intron=  $dirty =~ /^$valid(.*?)/;
    for my $intron (  @intron  ) {
        next   if  ! length($intron);
        print "$intron\n";
    }
}
__END__
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT
[download]

which produces:

GCACCG
TAGG
ATCTC
TCG

If we aren't sure whether the strings actually do match, then you could use an experimental regex feature to disable backtracking at certain points (that we know is doomed to fail). But I'll just append the shorter string onto the end of the longer string. This ensures a match and it is easy to tell if the original string would not have matched:

#!/usr/bin/perl
use strict;

while(  1  ) {
    my $valid= <DATA>  or  exit( 0 );
    chomp( $valid );
    my $dirty= <DATA>  or  die "Missing second sequence";
    chomp( $dirty );
    print "Comparing $valid\nto $dirty\n";
    my $dlen= length( $dirty );
    $dirty .= $valid;
    $valid =~ s#(...)#(.*?)\Q$1\E#g;
    my @intron=  $dirty =~ /^$valid(.*?)/
        or  die "Ouch!  We backtracked";
    if(  $dlen < $+[0]  ) {
        print "No match.\n";
    } else {
        for my $intron (  @intron  ) {
            next   if  ! length($intron);
            print "$intron\n";
        }
    }
}
__END__
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACT
[download]

Comparing ATGGAGTCGACGAATTTGAAGAAT
to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT
GCACCG
TAGG
ATCTC
TCG
Comparing ATGGAGTCGACGAATTTGAAGAAT
to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACT
No match.

Now I assume that these DNA/RNA strings can get very, very long. This solution should be pretty efficient but it would compile a very, very large regular expression. This might require too much memory. Luckily it is easy (and still efficient) to do the problem in chunks to reduce memory requirements.

#!/usr/bin/perl
use strict;

my $maxCodons= 2;

while(  1  ) {
    my $valid= <DATA>  or  exit( 0 );
    chomp( $valid );
    my $dirty= <DATA>  or  die "Missing second sequence";
    chomp( $dirty );
    print "Comparing $valid\nto $dirty\n";
    my $dlen= length( $dirty );
    $dirty .= $valid;
    my $pos= 0;
    while(  "" ne $valid  ) {
        my $part= substr( $valid, 0, 3*$maxCodons, "" );
        $part =~ s#(...)#(.*?)\Q$1\E#g;
        pos($dirty)= $pos;
        my @intron=  $dirty =~ /\G$part/
            or  die "Ouch!  We backtracked";
        $pos= $+[0];
        if(  $dlen < $pos  ) {
            print "Match was not complete.\n";
            $valid= "";
        } else {
            for my $intron (  @intron  ) {
                next   if  ! length($intron);
                print "$intron\n";
            }
        }
        if(  "" eq $valid  &&  $pos < $dlen  ) {
            print substr( $dirty, $pos, $dlen-$pos ), $/;
        }
    }
}
__END__
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAATTGG
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACTTGG
[download]

which produces

Comparing ATGGAGTCGACGAATTTGAAGAAT
to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAATTGG
GCACCG
TAGG
ATCTC
TCG
TGG
Comparing ATGGAGTCGACGAATTTGAAGAAT
to GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACTTGG
GCACCG
TAGG
ATCTC
Match was not complete.

If you set $maxCodons to 1 then you end up doing about the same as this index version:

#!/usr/bin/perl
use strict;

while(  1  ) {
    my $valid= <DATA>  or  exit( 0 );
    chomp( $valid );
    my $dirty= <DATA>  or  die "Missing second sequence";
    chomp( $dirty );
    print "Comparing $valid\nto $dirty\n";
    my $pos= 0;
    while(  "" ne $valid  ) {
        my $codon= substr( $valid, 0, 3, "" );
        my $next= index( $dirty, $codon, $pos );
        if(  $next < 0  ) {
            print "Match was not complete.\n";
            last;
        }
        if(  $pos < $next  ) {
            print substr( $dirty, $pos, $next-$pos ), $/;
        }
        $pos= $next + 3;
        if(  "" eq $valid  &&  $pos < length($dirty)  ) {
            print substr( $dirty, $pos ), $/;
        }
    }
}
__END__
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAATTGG
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGACTTGG
[download]

which produces the same results.

I think I may have one or more off-by-one errors left in the above code despite the quick tests seeming to indicate otherwise. (:

- tye

In reply to Re: Complicated pattern match (from CB) by tye
in thread Complicated pattern match by dr_jgbn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.