Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Fast common substring matching

by GrandFather (Saint)
on Aug 21, 2005 at 02:12 UTC ( [id://485464]=CUFP: print w/replies, xml ) Need Help??

This code was written as a solution to the problem posed in Search for identical substrings. As best I can tell it runs about 3 million times faster than the original code.

The code reads a series of strings and searches them for the longest substring between any pair of strings. In the original problem there were 300 strings about 3K long each. A test set comprising 6 strings was used to test the code with the result given below.

Someone with Perl module creation and publication experience could wrap this up and publish it if they wish.

use strict; use warnings; use Time::HiRes; use List::Util qw(min max); my $allLCS = 1; my $subStrSize = 8; # Determines minimum match length. Should be a pow +er of 2 # and less than half the minimum interesting match length. The larger +this value # the faster the search runs. if (@ARGV != 1) { print "Finds longest matching substring between any pair of test s +trings\n"; print "the given file. Pairs of lines are expected with the first +of a\n"; print "pair being the string name and the second the test string." +; exit (1); } # Read in the strings my @strings; while (<>) { chomp; my $strName = $_; $_ = <>; chomp; push @strings, [$strName, $_]; } my $lastStr = @strings - 1; my @bestMatches = [(0, 0, 0, 0, 0)]; # Best match details my $longest = 0; # Best match length so far (unexpanded) my $startTime = [Time::HiRes::gettimeofday ()]; # Do the search for (0..$lastStr) { my $curStr = $_; my @subStrs; my $source = $strings[$curStr][1]; my $sourceName = $strings[$curStr][0]; for (my $i = 0; $i < length $source; $i += $subStrSize) { push @subStrs, substr $source, $i, $subStrSize; } my $lastSub = @subStrs-1; for (($curStr+1)..$lastStr) { my $targetStr = $_; my $target = $strings[$_][1]; my $targetLen = length $target; my $targetName = $strings[$_][0]; my $localLongest = 0; my @localBests = [(0, 0, 0, 0, 0)]; for my $i (0..$lastSub) { my $offset = 0; while ($offset < $targetLen) { $offset = index $target, $subStrs[$i], $offset; last if $offset < 0; my $matchStr1 = substr $source, $i * $subStrSize; my $matchStr2 = substr $target, $offset; ($matchStr1 ^ $matchStr2) =~ /^\0*/; my $matchLen = $+[0]; next if $matchLen < $localLongest - $subStrSize + 1; $localLongest = $matchLen; my @test = ($curStr, $targetStr, $i * $subStrSize, $offset, $m +atchLen); @test = expandMatch (@test); my $dm = $test[4] - $localBests[-1][4]; @localBests = () if $dm > 0; push @localBests, [@test] if $dm >= 0; $offset = $test[3] + $test[4]; next if $test[4] < $longest; $longest = $test[4]; $dm = $longest - $bestMatches[-1][4]; next if $dm < 0; @bestMatches = () if $dm > 0; push @bestMatches, [@test]; } continue {++$offset;} } next if ! $allLCS; if (! @localBests) { print "Didn't find LCS for $sourceName and $targetName\n"; next; } for (@localBests) { my @curr = @$_; printf "%03d:%03d L[%4d] (%4d %4d)\n", $curr[0], $curr[1], $curr[4], $curr[2], $curr[3]; } } } print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n"; for (@bestMatches) { my @curr = @$_; printf "Best match: %s - %s. %d characters starting at %d and %d.\n" +, $strings[$curr[0]][0], $strings[$curr[1]][0], $curr[4], $curr[2], +$curr[3]; } sub expandMatch { my ($index1, $index2, $str1Start, $str2Start, $matchLen) = @_; my $maxMatch = max (0, min ($str1Start, $subStrSize + 10, $str2Start)) +; my $matchStr1 = substr ($strings[$index1][1], $str1Start - $maxMatch, +$maxMatch); my $matchStr2 = substr ($strings[$index2][1], $str2Start - $maxMatch, +$maxMatch); ($matchStr1 ^ $matchStr2) =~ /\0*$/; my $adj = $+[0] - $-[0]; $matchLen += $adj; $str1Start -= $adj; $str2Start -= $adj; return ($index1, $index2, $str1Start, $str2Start, $matchLen); }

Output using bioMan's six string sample:

Completed in 0.010486 Best match: >string 1 - >string 3 . 1271 characters starting at 82 an +d 82.
Updates: fixed a few bugs. Added print all LCS's option.
Fixed readmore tags. Fixed all remaining known bugs. Added reporting for LCS's between each pair and for all LCS's where there is more than one LCS of the maximum length. Code now 20% smaller and maybe 10% slower.

Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re: Fast common substring matching
by BrowserUk (Patriarch) on Aug 24, 2005 at 13:39 UTC

    You're still not locating all equal-length LCSs.

    P:\test>type duptest.dat >string1 AAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTTT >string2 TTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAA >string3 TTTTTTTTTTTTTTTTTxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAAAAAAAAAAA P:\test>gf duptest.gf 000:001 L[ 17] ( 0 47) 000:002 L[ 17] ( 0 47) 001:002 L[ 17] ( 0 0) Completed in 0.001205 Best match: >string1 - >string2. 17 characters starting at 0 and 47. Best match: >string1 - >string3. 17 characters starting at 0 and 47. Best match: >string2 - >string3. 17 characters starting at 0 and 0.

    Each pairing contains two equal matches:

    P:\test>484593-5 duptest.dat 000:001 L[017] (0000,0047)'TTTTTTTTTTTTTTTTT' (0047,0000)'AAAAAAAAAAAAAAAAA' 000:002 L[017] (0000,0000)'TTTTTTTTTTTTTTTTT' (0047,0047)'AAAAAAAAAAAAAAAAA' 001:002 L[017] (0000,0047)'AAAAAAAAAAAAAAAAA' (0047,0000)'TTTTTTTTTTTTTTTTT' 3 trials of duptest.dat ( 323us total), 107us/trial

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

      It's a special case and I don't think it is a problem in practice. It only happens when there is a block longer than $subStrSize (the minimum match quanta) with a repeated pattern. Test strings and results are shown below:

      >string1 01010101010ddddddddddd01234566789a12345yy >string2 0123456789b12345eeeeeeeeeeeex01010101010x >string3 0123456789c12345ffffffffffff01010101010zz 000:001 L[ 11] ( 0 29) 000:002 L[ 11] ( 0 28) 001:002 L[ 10] ( 0 0) Completed in 0.002126 Best match: >string1 - >string2. 11 characters starting at 0 and 29. Best match: >string1 - >string3. 11 characters starting at 0 and 28.

      Perl is Huffman encoded by design.

        Okay, but I'm not sure whether you can safetly discount the possibility of long sequences of repeated characters, even in biodata. The following shows a scan of the complete drosophila (fruit fly) genome looking for sequences of repeated DNA characters:

        >perl -nlwe" print $1 while m[((.)\2+)]g" na_clones.dros.RELEASE2.5 | +( More? perl -nle"$b{ length($_) }{ chop $_}++ } { printf qq[length %d :[ A:%d C:%d G:%d T:%d ]\n], $_, @{$b{$_}}{'A','C','G','T'} for sort{$b<=>$a} keys %b" ) length 50 :[ A:1 C:0 G:0 T:0 ] length 47 :[ A:1 C:0 G:0 T:0 ] length 45 :[ A:1 C:0 G:0 T:0 ] length 44 :[ A:0 C:0 G:0 T:0 ] length 43 :[ A:0 C:0 G:0 T:1 ] length 42 :[ A:1 C:0 G:0 T:0 ] length 41 :[ A:1 C:0 G:0 T:1 ] length 40 :[ A:1 C:0 G:0 T:0 ] length 39 :[ A:0 C:0 G:0 T:0 ] length 38 :[ A:1 C:0 G:0 T:0 ] length 37 :[ A:2 C:0 G:0 T:0 ] length 36 :[ A:5 C:0 G:0 T:4 ] length 35 :[ A:5 C:0 G:0 T:3 ] length 34 :[ A:2 C:0 G:0 T:3 ] length 33 :[ A:2 C:0 G:0 T:3 ] length 32 :[ A:1 C:0 G:0 T:1 ] length 31 :[ A:5 C:0 G:0 T:5 ] length 30 :[ A:4 C:0 G:0 T:5 ] length 29 :[ A:10 C:0 G:0 T:6 ] length 28 :[ A:12 C:0 G:0 T:13 ] length 27 :[ A:19 C:0 G:0 T:12 ] length 26 :[ A:26 C:0 G:0 T:28 ] length 25 :[ A:35 C:1 G:1 T:34 ] length 24 :[ A:48 C:0 G:0 T:47 ] length 23 :[ A:63 C:0 G:1 T:59 ] length 22 :[ A:78 C:1 G:0 T:84 ] length 21 :[ A:109 C:5 G:3 T:91 ] length 20 :[ A:126 C:5 G:6 T:144 ] length 19 :[ A:188 C:17 G:15 T:148 ] length 18 :[ A:240 C:25 G:26 T:314 ] length 17 :[ A:411 C:45 G:55 T:389 ] length 16 :[ A:647 C:73 G:78 T:656 ] length 15 :[ A:905 C:122 G:133 T:886 ] length 14 :[ A:1267 C:191 G:216 T:1208 ] length 13 :[ A:1813 C:255 G:260 T:1805 ] length 12 :[ A:2702 C:353 G:353 T:2800 ] length 11 :[ A:4209 C:380 G:404 T:4184 ] length 10 :[ A:7181 C:446 G:428 T:7343 ] length 9 :[ A:9964 C:386 G:437 T:10011 ] length 8 :[ A:14581 C:699 G:601 T:14514 ] length 7 :[ A:28336 C:2656 G:2644 T:28782 ] length 6 :[ A:90274 C:11644 G:11543 T:89663 ] length 5 :[ A:308601 C:52038 G:52017 T:309011 ] length 4 :[ A:889188 C:199827 G:200919 T:885567 ] length 3 :[ A:2424590 C:938297 G:935629 T:2422964 ] length 2 :[ A:6232754 C:4721693 G:4718156 T:6219367 ]

        As you can see, at lengths over 16, they are still a fairly frequent occurence. And even at lengths > 32, there are still enough to be worrying.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Fast common substring matching
by BrowserUk (Patriarch) on Aug 27, 2005 at 06:20 UTC

    The latest revisions to your code have fixed most of the problems I identified earlier, but running it over my test suite did turn up this anomoly:

    P:\test>GF n2l3172.dat 000:001 L[ 12] ( 364 23) 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] (1102 1008) 000:001 L[ 12] (1124 2290) 000:001 L[ 12] (1506 2529) 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1859 224) 000:001 L[ 12] (2046 2360) 000:001 L[ 12] (2399 1226) 000:001 L[ 12] (2480 401) 000:001 L[ 12] (2494 540) 000:001 L[ 12] (3023 1102) Completed in 5.292793 Best match: >1 - >2. 12 characters starting at 364 and 23. Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 1102 and 1008. Best match: >1 - >2. 12 characters starting at 1124 and 2290. Best match: >1 - >2. 12 characters starting at 1506 and 2529. Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1859 and 224. Best match: >1 - >2. 12 characters starting at 2046 and 2360. Best match: >1 - >2. 12 characters starting at 2399 and 1226. Best match: >1 - >2. 12 characters starting at 2480 and 401. Best match: >1 - >2. 12 characters starting at 2494 and 540. Best match: >1 - >2. 12 characters starting at 3023 and 1102.

    As you can see, the matches I've marked with * are duplicates. Here's the output from my program:

    P:\test>484593-5 n2l3172.dat 000:001 L[012] ( 364, 23)'CAGGAGCGGGCG' (1102,1008)'ACAGCCACGAAA' (1124,2290)'AGCAGGAAGGAG' (1506,2529)'GCGCGAAGCAAA' (1859, 224)'AACAGGAAGGGA' (2046,2360)'CAGAGCCCACAG' (2399,1226)'GGCGCAACCCGG' (2480, 401)'GGGGCCGGCAAC' (2494, 540)'CGCAGCAGAAAC' (3023,1102)'GGCAGCAAGGGC' 1 trial of n2l3172.dat (132.223ms total), 132.223ms/trial

    As the dataset lines are 3k in length, I've posted it on my pad rather than here.

    Sorry for the length of the lines, but they are randomly generated and I haven't yet worked out what it is about these two lines that induces the anomolous behaviour?

    Update: Okay. I managed to find a shorter testcase that demonstrates the duplicate matches problem. Again, I've marked the duplicates with *:

    P:\test\LCS>type temp >string 1 AACCAGGGTCATAGGAGCCGGCACTATGATCTCTAGTGTTTTAGGAGTAGTGAACCACCTTCCAGACACG +GCAGCCAACATTCCGTCTATCGCCTACTCG >string 2 TAGCACGGTGATAAGCTACAATTAACGCCATCTCACGCGCGCGAGTGTGTTCTACTACCCATTATATAAG +ATGAAAATGAACACTAAGGGCTGGGGCCCT P:\test\LCS>gf temp 000:001 L[ 5] ( 21 81) 000:001 L[ 5] ( 21 81)* 000:001 L[ 5] ( 28 29) 000:001 L[ 5] ( 34 43) 000:001 L[ 5] ( 35 46) 000:001 L[ 5] ( 50 77) 000:001 L[ 5] ( 66 3) 000:001 L[ 5] ( 66 3)* 000:001 L[ 5] ( 93 51) 000:001 L[ 5] ( 93 51)* Completed in 0.009414 Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 28 and 29. Best match: >string 1 - >string 2. 5 characters starting at 34 and 43. Best match: >string 1 - >string 2. 5 characters starting at 35 and 46. Best match: >string 1 - >string 2. 5 characters starting at 50 and 77. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. P:\test\LCS>484593-5 temp 000:001 L[005] ( 21, 81)'CACTA' ( 28, 29)'ATCTC' ( 34, 43)'AGTGT' ( 35, 46)'GTGTT' ( 50, 77)'TGAAC' ( 66, 3)'CACGG' ( 93, 51)'CTACT' 1 trial of temp ( 147us total), 147us/trial

    Update2: This is the shortest test that I've found that reproduces the problem. It may help you to isolate the cause.

    P:\test\LCS>type temp >string 1 AAATATCTCGATAATTAGTA >string 2 TGACTAGATCATATAACCAC P:\test\LCS>gf temp 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 10 12) Completed in 0.00106 Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 10 and 12. P:\test\LCS>484593-5 temp 000:001 L[004] ( 2, 10)'ATAT' ( 10, 12)'ATAA' 1 trial of temp ( 36us total), 36us/trial

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

      Another update fixing the last round of issues.


      Perl is Huffman encoded by design.
        I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length).

        Give it a whirl.


        Caution: Contents may have been coded under pressure.
Re: Fast common substring matching
by GrandFather (Saint) on Aug 26, 2005 at 23:11 UTC

    The following is a copy of Re^9: Fast common substring matching posted by bioMan, but the nesting was so deep I though I'd "bring it to the top".

    I have done a quick test. Quick because of your blazing fast algorithm. My runtimes for 3 to 200 strings is given below. The times come from the Time::HiRes::gettimeofday function in your program. I set $subStrSize = 256.

    3,0.001355 4,0.002366 5,0.003744 6,0.005187 7,0.006808 8,0.008807 9,0.011459 10,0.014172 11,0.017328 12,0.020473 13,0.023491 14,0.027905 15,0.031813 16,0.036426 17,0.043177 18,0.067228 19,0.052098 20,0.059449 30,0.132611 40,0.235241 50,0.367828 60,0.510851 70,0.695534 80,0.913879 90,1.178992 100,1.427899 125,2.254988 150,3.269179 200,5.831225

    These results are from a second copy of your program that I created. It appears my first copy still has a bug.

    Yes, there must be a bug in my original copy. Running my full data set with my backup copy of your program gave a runtime of 14.759568. I'd say that's a little better than three years! I'll check the results to see how they compare with some of my old results from limited data sets. That check could take a little time :-)

    The problem could be with the way PerlIDE runs scripts. All the times above are from the command line (Win2000).

    I ran the full data set with the second copy of the program from PerlIDE and got a runtime of 15.147881. So the problem is with my first copy of your program.

    I really like your algorithm. It's speed will allow me to modify my original data and test a couple of theories. You've got a real winner.


    Perl is Huffman encoded by design.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://485464]
Approved by Roger
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-03-19 02:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found