I've posted code here that seems to be about 7000 times faster than BrowserUk's solution
[ 8:05:03.68] P:\test>484593-3 -MIN=128 bioman.dat Best match len:1271 between string 0:82 & string 2:82 15 trials of bioman.dat ( 5.443s total), 362.894ms/trial
Which means your still 500+ X faster, but it allows me to save face (a little :).
Whilst your comments say that this can be adjusted to a lower power of 2, when I tried this, I got strangely 'almost correct' values for both length and offset. I tracked this down to ...
To demonstrate, I trimmed 1 character from the front of each line in biomain's data and ran your code with $subStrSize = 128 and got:
P:\test>gf bioman2.dat Completed in 0.009606 Best match: >string 1 - >string 3 . 1273 characters starting at 80 an +d 80.
Note that the length has increased by 2 when it should be the same, and both offsets have reduced by 2 when they should have reduced by 1?
I also tried setting $subStrSize = 1, as thought that might correct the problem, but apart from running much more slowly, it still produced the same, slightly wrong output:
[ 7:49:55.56] P:\test>gf bioman2.dat Completed in 76.466764 Best match: >string 1 - >string 3 . 1273 characters starting at 80 an +d 80.
All that said, your algorithm is blazingly fast and the anomolies can probably be corrected.
If so, or if bioman only needs the single LCS, and can live with specifying a minimum match and arranging for his input data to always be a multiple of that minimum size, it blow's what I have into dust.
At least, until I can adapt my code to use elements of your search technique :). Nice++
In reply to Re^2: Search for identical substrings
by BrowserUk
in thread Search for identical substrings
by bioMan
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |