http://qs1969.pair.com?node_id=487102


in reply to Fast common substring matching

The latest revisions to your code have fixed most of the problems I identified earlier, but running it over my test suite did turn up this anomoly:

P:\test>GF n2l3172.dat 000:001 L[ 12] ( 364 23) 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] ( 364 23)* 000:001 L[ 12] (1102 1008) 000:001 L[ 12] (1124 2290) 000:001 L[ 12] (1506 2529) 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1506 2529)* 000:001 L[ 12] (1859 224) 000:001 L[ 12] (2046 2360) 000:001 L[ 12] (2399 1226) 000:001 L[ 12] (2480 401) 000:001 L[ 12] (2494 540) 000:001 L[ 12] (3023 1102) Completed in 5.292793 Best match: >1 - >2. 12 characters starting at 364 and 23. Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 364 and 23.* Best match: >1 - >2. 12 characters starting at 1102 and 1008. Best match: >1 - >2. 12 characters starting at 1124 and 2290. Best match: >1 - >2. 12 characters starting at 1506 and 2529. Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1506 and 2529.* Best match: >1 - >2. 12 characters starting at 1859 and 224. Best match: >1 - >2. 12 characters starting at 2046 and 2360. Best match: >1 - >2. 12 characters starting at 2399 and 1226. Best match: >1 - >2. 12 characters starting at 2480 and 401. Best match: >1 - >2. 12 characters starting at 2494 and 540. Best match: >1 - >2. 12 characters starting at 3023 and 1102.

As you can see, the matches I've marked with * are duplicates. Here's the output from my program:

P:\test>484593-5 n2l3172.dat 000:001 L[012] ( 364, 23)'CAGGAGCGGGCG' (1102,1008)'ACAGCCACGAAA' (1124,2290)'AGCAGGAAGGAG' (1506,2529)'GCGCGAAGCAAA' (1859, 224)'AACAGGAAGGGA' (2046,2360)'CAGAGCCCACAG' (2399,1226)'GGCGCAACCCGG' (2480, 401)'GGGGCCGGCAAC' (2494, 540)'CGCAGCAGAAAC' (3023,1102)'GGCAGCAAGGGC' 1 trial of n2l3172.dat (132.223ms total), 132.223ms/trial

As the dataset lines are 3k in length, I've posted it on my pad rather than here.

Sorry for the length of the lines, but they are randomly generated and I haven't yet worked out what it is about these two lines that induces the anomolous behaviour?

Update: Okay. I managed to find a shorter testcase that demonstrates the duplicate matches problem. Again, I've marked the duplicates with *:

P:\test\LCS>type temp >string 1 AACCAGGGTCATAGGAGCCGGCACTATGATCTCTAGTGTTTTAGGAGTAGTGAACCACCTTCCAGACACG +GCAGCCAACATTCCGTCTATCGCCTACTCG >string 2 TAGCACGGTGATAAGCTACAATTAACGCCATCTCACGCGCGCGAGTGTGTTCTACTACCCATTATATAAG +ATGAAAATGAACACTAAGGGCTGGGGCCCT P:\test\LCS>gf temp 000:001 L[ 5] ( 21 81) 000:001 L[ 5] ( 21 81)* 000:001 L[ 5] ( 28 29) 000:001 L[ 5] ( 34 43) 000:001 L[ 5] ( 35 46) 000:001 L[ 5] ( 50 77) 000:001 L[ 5] ( 66 3) 000:001 L[ 5] ( 66 3)* 000:001 L[ 5] ( 93 51) 000:001 L[ 5] ( 93 51)* Completed in 0.009414 Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 21 and 81. Best match: >string 1 - >string 2. 5 characters starting at 28 and 29. Best match: >string 1 - >string 2. 5 characters starting at 34 and 43. Best match: >string 1 - >string 2. 5 characters starting at 35 and 46. Best match: >string 1 - >string 2. 5 characters starting at 50 and 77. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 66 and 3. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. Best match: >string 1 - >string 2. 5 characters starting at 93 and 51. P:\test\LCS>484593-5 temp 000:001 L[005] ( 21, 81)'CACTA' ( 28, 29)'ATCTC' ( 34, 43)'AGTGT' ( 35, 46)'GTGTT' ( 50, 77)'TGAAC' ( 66, 3)'CACGG' ( 93, 51)'CTACT' 1 trial of temp ( 147us total), 147us/trial

Update2: This is the shortest test that I've found that reproduces the problem. It may help you to isolate the cause.

P:\test\LCS>type temp >string 1 AAATATCTCGATAATTAGTA >string 2 TGACTAGATCATATAACCAC P:\test\LCS>gf temp 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 2 10) 000:001 L[ 4] ( 10 12) Completed in 0.00106 Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 2 and 10. Best match: >string 1 - >string 2. 4 characters starting at 10 and 12. P:\test\LCS>484593-5 temp 000:001 L[004] ( 2, 10)'ATAT' ( 10, 12)'ATAA' 1 trial of temp ( 36us total), 36us/trial

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Replies are listed 'Best First'.
Re^2: Fast common substring matching
by GrandFather (Saint) on Sep 18, 2005 at 09:53 UTC

    Another update fixing the last round of issues.


    Perl is Huffman encoded by design.
      I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length).

      Give it a whirl.


      Caution: Contents may have been coded under pressure.

        Roy

        There is one difference between your algorithm and Grandfather's. His code returns the longest substring for each pair of input strings.

        With my original data set your code returns one substring. Grandfather's code returned over three thousand (where $minmatch = 256). On the other hand your code finds multiple occurrences of the longest common substrings, if they all have the same length, which I like.

        Mike

        Greetings Roy Johnson,

        I learn more Perl from reading your code. It took me a while, but somehow overlooked the line sorting $strings before walking through the list.

        Today, came to realization that Perl can do bitwise operations on strings.

        my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ /^ +(\0*)/;

        Thank you for this.