Re: Fast common substring matching

in reply to Fast common substring matching

The latest revisions to your code have fixed most of the problems I identified earlier, but running it over my test suite did turn up this anomoly:

P:\test>GF n2l3172.dat
000:001 L[  12] ( 364   23)
000:001 L[  12] ( 364   23)*
000:001 L[  12] ( 364   23)*
000:001 L[  12] ( 364   23)*
000:001 L[  12] (1102 1008)
000:001 L[  12] (1124 2290)
000:001 L[  12] (1506 2529)
000:001 L[  12] (1506 2529)*
000:001 L[  12] (1506 2529)*
000:001 L[  12] (1506 2529)*
000:001 L[  12] (1859  224)
000:001 L[  12] (2046 2360)
000:001 L[  12] (2399 1226)
000:001 L[  12] (2480  401)
000:001 L[  12] (2494  540)
000:001 L[  12] (3023 1102)
Completed in 5.292793
Best match: >1 - >2. 12 characters starting at 364 and 23.
Best match: >1 - >2. 12 characters starting at 364 and 23.*
Best match: >1 - >2. 12 characters starting at 364 and 23.*
Best match: >1 - >2. 12 characters starting at 364 and 23.*
Best match: >1 - >2. 12 characters starting at 1102 and 1008.
Best match: >1 - >2. 12 characters starting at 1124 and 2290.
Best match: >1 - >2. 12 characters starting at 1506 and 2529.
Best match: >1 - >2. 12 characters starting at 1506 and 2529.*
Best match: >1 - >2. 12 characters starting at 1506 and 2529.*
Best match: >1 - >2. 12 characters starting at 1506 and 2529.*
Best match: >1 - >2. 12 characters starting at 1859 and 224.
Best match: >1 - >2. 12 characters starting at 2046 and 2360.
Best match: >1 - >2. 12 characters starting at 2399 and 1226.
Best match: >1 - >2. 12 characters starting at 2480 and 401.
Best match: >1 - >2. 12 characters starting at 2494 and 540.
Best match: >1 - >2. 12 characters starting at 3023 and 1102.
[download]

As you can see, the matches I've marked with * are duplicates. Here's the output from my program:

P:\test>484593-5 n2l3172.dat
000:001 L[012]   ( 364,  23)'CAGGAGCGGGCG'
                 (1102,1008)'ACAGCCACGAAA'
                 (1124,2290)'AGCAGGAAGGAG'
                 (1506,2529)'GCGCGAAGCAAA'
                 (1859, 224)'AACAGGAAGGGA'
                 (2046,2360)'CAGAGCCCACAG'
                 (2399,1226)'GGCGCAACCCGG'
                 (2480, 401)'GGGGCCGGCAAC'
                 (2494, 540)'CGCAGCAGAAAC'
                 (3023,1102)'GGCAGCAAGGGC'
  1 trial  of n2l3172.dat (132.223ms total), 132.223ms/trial
[download]

As the dataset lines are 3k in length, I've posted it on my pad rather than here.

Sorry for the length of the lines, but they are randomly generated and I haven't yet worked out what it is about these two lines that induces the anomolous behaviour?

Update: Okay. I managed to find a shorter testcase that demonstrates the duplicate matches problem. Again, I've marked the duplicates with *:

P:\test\LCS>type temp
>string 1
AACCAGGGTCATAGGAGCCGGCACTATGATCTCTAGTGTTTTAGGAGTAGTGAACCACCTTCCAGACACG
+GCAGCCAACATTCCGTCTATCGCCTACTCG
>string 2
TAGCACGGTGATAAGCTACAATTAACGCCATCTCACGCGCGCGAGTGTGTTCTACTACCCATTATATAAG
+ATGAAAATGAACACTAAGGGCTGGGGCCCT

P:\test\LCS>gf temp
000:001 L[   5] (  21   81)
000:001 L[   5] (  21   81)*
000:001 L[   5] (  28   29)
000:001 L[   5] (  34   43)
000:001 L[   5] (  35   46)
000:001 L[   5] (  50   77)
000:001 L[   5] (  66    3)
000:001 L[   5] (  66    3)*
000:001 L[   5] (  93   51)
000:001 L[   5] (  93   51)*
Completed in 0.009414
Best match: >string 1 - >string 2. 5 characters starting at 21 and 81.
Best match: >string 1 - >string 2. 5 characters starting at 21 and 81.
Best match: >string 1 - >string 2. 5 characters starting at 28 and 29.
Best match: >string 1 - >string 2. 5 characters starting at 34 and 43.
Best match: >string 1 - >string 2. 5 characters starting at 35 and 46.
Best match: >string 1 - >string 2. 5 characters starting at 50 and 77.
Best match: >string 1 - >string 2. 5 characters starting at 66 and 3.
Best match: >string 1 - >string 2. 5 characters starting at 66 and 3.
Best match: >string 1 - >string 2. 5 characters starting at 93 and 51.
Best match: >string 1 - >string 2. 5 characters starting at 93 and 51.

P:\test\LCS>484593-5 temp
000:001 L[005]   (  21,  81)'CACTA'
                 (  28,  29)'ATCTC'
                 (  34,  43)'AGTGT'
                 (  35,  46)'GTGTT'
                 (  50,  77)'TGAAC'
                 (  66,   3)'CACGG'
                 (  93,  51)'CTACT'
  1 trial  of temp (    147us total), 147us/trial
[download]

Update2: This is the shortest test that I've found that reproduces the problem. It may help you to isolate the cause.

P:\test\LCS>type temp
>string 1
AAATATCTCGATAATTAGTA
>string 2
TGACTAGATCATATAACCAC

P:\test\LCS>gf temp
000:001 L[   4] (   2   10)
000:001 L[   4] (   2   10)
000:001 L[   4] (  10   12)
Completed in 0.00106
Best match: >string 1 - >string 2. 4 characters starting at 2 and 10.
Best match: >string 1 - >string 2. 4 characters starting at 2 and 10.
Best match: >string 1 - >string 2. 4 characters starting at 10 and 12.

P:\test\LCS>484593-5 temp
000:001 L[004]   (   2,  10)'ATAT'
                 (  10,  12)'ATAA'
  1 trial  of temp (     36us total), 36us/trial
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Comment on Re: Fast common substring matching Select or Download Code

Replies are listed 'Best First'.
Re^2: Fast common substring matching by GrandFather (Saint) on Sep 18, 2005 at 09:53 UTC
Another update fixing the last round of issues. Read more... the code (8 kB) Perl is Huffman encoded by design.	[reply] [d/l]
Re^3: Fast common substring matching by Roy Johnson (Monsignor) on Nov 14, 2005 at 21:42 UTC
I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length). Give it a whirl. Read more... (3 kB) Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^4: Fast common substring matching by bioMan (Beadle) on Nov 29, 2005 at 16:53 UTC
Roy There is one difference between your algorithm and Grandfather's. His code returns the longest substring for each pair of input strings. With my original data set your code returns one substring. Grandfather's code returned over three thousand (where $minmatch = 256). On the other hand your code finds multiple occurrences of the longest common substrings, if they all have the same length, which I like. Mike	[reply]
Re^5: Fast common substring matching by Roy Johnson (Monsignor) on Nov 29, 2005 at 17:08 UTC
Re^6: Fast common substring matching by bioMan (Beadle) on Nov 29, 2005 at 22:37 UTC
Some notes below your chosen depth have not been shown here
Re^6: Fast common substring matching by marioroy (Prior) on Feb 18, 2016 at 00:01 UTC
Re^4: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 06:39 UTC
Greetings Roy Johnson, I learn more Perl from reading your code. It took me a while, but somehow overlooked the line sorting $strings before walking through the list. Today, came to realization that Perl can do bitwise operations on strings. `my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ /^ +(\0*)/;` [download] Thank you for this.	[reply] [d/l]
Re^5: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 07:36 UTC
Re^5: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 08:34 UTC
Re^4: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 20:31 UTC
Greetings Roy Johnson, I was drawn to your example and wanted to try enabling parallelism. Well, I'm happy to report that it works quite well. MCE::Hobo demonstration MCE::Loop demonstration Parallelism is likely beneficial for larger sequences. Regards, Mario	[reply]

In Section Cool Uses for Perl