Re^9: Fast common substring matching

I have done a quick test. Quick because of your blazing fast algorithm. My runtimes for 3 to 200 strings is given below. The times come from the Time::HiRes::gettimeofday function in your program. I set $subStrSize = 256.

3,0.001355
4,0.002366
5,0.003744
6,0.005187
7,0.006808
8,0.008807
9,0.011459
10,0.014172
11,0.017328
12,0.020473
13,0.023491
14,0.027905
15,0.031813
16,0.036426
17,0.043177
18,0.067228
19,0.052098
20,0.059449
30,0.132611
40,0.235241
50,0.367828
60,0.510851
70,0.695534
80,0.913879
90,1.178992
100,1.427899
125,2.254988
150,3.269179
200,5.831225
[download]

These results are from a second copy of your program that I created. It appears my first copy still has a bug.

Yes, there must be a bug in my original copy. Running my full data set with my backup copy of your program gave a runtime of 14.759568. I'd say that's a little better than three years! I'll check the results to see how they compare with some of my old results from limited data sets. That check could take a little time :-)

The problem could be with the way PerlIDE runs scripts. All the times above are from the command line (Win2000).

I ran the full data set with the second copy of the program from PerlIDE and got a runtime of 15.147881. So the problem is with my first copy of your program.

I really like your algorithm. It's speed will allow me to modify my original data and test a couple of theories. You've got a real winner.

update

I checked the results from the program and the algorithm isn't finding all the matches. I made a modification which seems to do better.

I added a new variable called $filterSize. This is used to exclude matches below a minimum length for output. It replaces $subStrSize. $subStrSize is now used to set the minimum string search size. The statement used to output the results is also modified.

my $subStrSize = 128;
my $filterSize = 256;

.
.
.

    $localBest[4] > $filterSize
      ? printf OUT "%10s::%-10s  Length[%4d]  Starts(%4d %4d)\n",
          $strings[$localBest[0]][0], $strings[$localBest[1]][0], $loc
+alBest[4], $localBest[2], $localBest[3]
      : print OUT "  Didn't find LCS for $sourceName and $targetName\n
+";
[download]

The last statement replaces:

    @localBest
      ? printf OUT "%10s::%-10s  Length[%4d]  Starts(%4d %4d)\n",
          $strings[$localBest[0]][0], $strings[$localBest[1]][0], $loc
+alBest[4], $localBest[2], $localBest[3]
      : print OUT "  Didn't find LCS for $sourceName and $targetName\n
+";
[download]

I hoped this would give me better results. I used $filterSize = 256. I set $subStrSize to 16, 32, 64, or 128. The runtimes for $subStrSize were 250.505445, 124.928021, 57.274645, and 32.787612, respectively. I got the best results with $subStrSize = 128, $filterSize = 256. I don't know why $subStrSize = 128 gave better results than shorter lengths. Also $subStrSize = 256 gave significantly poorer results than any of the shorter lengths.

Comment on Re^9: Fast common substring matching Select or Download Code


good chemistry is complicated, and a little bit messy -LW
	PerlMonks