http://qs1969.pair.com?node_id=486414


in reply to Re^5: Fast common substring matching
in thread Fast common substring matching

I doubt that even a pure C solution using this technique would be much faster. I haven't profiled it, but I'd guess most of the time is in the index and that is likely pretty efficient anyway.

I think some fussy code could handle the special case without impacting performance too much. The key would be detecting that a search sub-pattern was a repeating pattern and then "drifting" the pattern left by the repeat length to see if there is an earlier match against the target string than was found by index. Maybe I need to write some code so you see what I mean? :)


Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re^7: Fast common substring matching
by bioMan (Beadle) on Aug 25, 2005 at 16:42 UTC

    I am presently running my complete dataset with your program. The program has been merrily churning away for about 48 hours. When it completes this task I'll let you know how things turned out.

    update

    Oops, I had to restart the run. When I set up the program I added specific code to hardwire the name of my data file into the program. When I did this I created a bug, which caused the program to idle. I had not removed the $_ = <>; line, so the program was waiting for me to enter data from the keyboard.

    commented out the if (@argv != 1){...} and added the following:

    my $file = "mydata.txt"; open FILE, $file or die "Can't open $file: $!\n"; my $out = "outdata.txt"; open OUT , '>', $out or die "Can't open $out: $!\n"; # all print and printf statements now print to # this file handle # Read in the strings chomp(my @file = <FILE>); # declare variables my @strings = (); my $place = 1; my $strName = ''; # necessary for resolution of # an undeclared global variable # warning for (@file){ if ($place){ $strName = $_; # seq ID $place = 0; }else{ push @strings, [$strName, $_]; # push seq ID, seq $place = 1; } }

      Extrapolating from BrowserUk's estimate in Re^5: Search for identical substrings (58 hours for the 300/3k data set) and my estimate that this code is about 7000 times faster than that, the total run time should be of the order of 30 seconds. If it is more than an hour something is very wrong. Even if it is more than a few minutes our understanding is flawed or there is a bug that wasn't shown by the six string data set.


      Perl is Huffman encoded by design.

        I have done a quick test. Quick because of your blazing fast algorithm. My runtimes for 3 to 200 strings is given below. The times come from the Time::HiRes::gettimeofday function in your program. I set $subStrSize = 256.

        3,0.001355 4,0.002366 5,0.003744 6,0.005187 7,0.006808 8,0.008807 9,0.011459 10,0.014172 11,0.017328 12,0.020473 13,0.023491 14,0.027905 15,0.031813 16,0.036426 17,0.043177 18,0.067228 19,0.052098 20,0.059449 30,0.132611 40,0.235241 50,0.367828 60,0.510851 70,0.695534 80,0.913879 90,1.178992 100,1.427899 125,2.254988 150,3.269179 200,5.831225

        These results are from a second copy of your program that I created. It appears my first copy still has a bug.

        Yes, there must be a bug in my original copy. Running my full data set with my backup copy of your program gave a runtime of 14.759568. I'd say that's a little better than three years! I'll check the results to see how they compare with some of my old results from limited data sets. That check could take a little time :-)

        The problem could be with the way PerlIDE runs scripts. All the times above are from the command line (Win2000).

        I ran the full data set with the second copy of the program from PerlIDE and got a runtime of 15.147881. So the problem is with my first copy of your program.

        I really like your algorithm. It's speed will allow me to modify my original data and test a couple of theories. You've got a real winner.

        update

        I checked the results from the program and the algorithm isn't finding all the matches. I made a modification which seems to do better.

        I added a new variable called $filterSize. This is used to exclude matches below a minimum length for output. It replaces $subStrSize. $subStrSize is now used to set the minimum string search size. The statement used to output the results is also modified.

        my $subStrSize = 128; my $filterSize = 256; . . . $localBest[4] > $filterSize ? printf OUT "%10s::%-10s Length[%4d] Starts(%4d %4d)\n", $strings[$localBest[0]][0], $strings[$localBest[1]][0], $loc +alBest[4], $localBest[2], $localBest[3] : print OUT " Didn't find LCS for $sourceName and $targetName\n +";

        The last statement replaces:

        @localBest ? printf OUT "%10s::%-10s Length[%4d] Starts(%4d %4d)\n", $strings[$localBest[0]][0], $strings[$localBest[1]][0], $loc +alBest[4], $localBest[2], $localBest[3] : print OUT " Didn't find LCS for $sourceName and $targetName\n +";

        I hoped this would give me better results. I used $filterSize = 256. I set $subStrSize to 16, 32, 64, or 128. The runtimes for $subStrSize were 250.505445, 124.928021, 57.274645, and 32.787612, respectively. I got the best results with $subStrSize = 128, $filterSize = 256. I don't know why $subStrSize = 128 gave better results than shorter lengths. Also $subStrSize = 256 gave significantly poorer results than any of the shorter lengths.