in reply to Re^4: Fast common substring matching
in thread Fast common substring matching

... for bioMan's problem a minimum match quanta of 128 is probably optimum and I'd guess that that is long enough to be unlikely to be a problem.

Seems to be. Scanning for repeating sequences of 2, 3 & 4 characters, none was longer then 50 chars, so a minimum quanta of 64 would also probably be possible.

inclined to ignore it unless someone can convince me that this is really useful

I understand that totally. I ended up resorting to Inline C to get speed because every attempt to improved the performance of my perl versions ended up missing things.

Shame though. Your technique is so very fast for a pure perl solution it would be a real coup if it could be generalised.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Replies are listed 'Best First'.
Re^6: Fast common substring matching
by GrandFather (Saint) on Aug 25, 2005 at 02:41 UTC

    I doubt that even a pure C solution using this technique would be much faster. I haven't profiled it, but I'd guess most of the time is in the index and that is likely pretty efficient anyway.

    I think some fussy code could handle the special case without impacting performance too much. The key would be detecting that a search sub-pattern was a repeating pattern and then "drifting" the pattern left by the repeat length to see if there is an earlier match against the target string than was found by index. Maybe I need to write some code so you see what I mean? :)

    Perl is Huffman encoded by design.

      I am presently running my complete dataset with your program. The program has been merrily churning away for about 48 hours. When it completes this task I'll let you know how things turned out.


      Oops, I had to restart the run. When I set up the program I added specific code to hardwire the name of my data file into the program. When I did this I created a bug, which caused the program to idle. I had not removed the $_ = <>; line, so the program was waiting for me to enter data from the keyboard.

      commented out the if (@argv != 1){...} and added the following:

      my $file = "mydata.txt"; open FILE, $file or die "Can't open $file: $!\n"; my $out = "outdata.txt"; open OUT , '>', $out or die "Can't open $out: $!\n"; # all print and printf statements now print to # this file handle # Read in the strings chomp(my @file = <FILE>); # declare variables my @strings = (); my $place = 1; my $strName = ''; # necessary for resolution of # an undeclared global variable # warning for (@file){ if ($place){ $strName = $_; # seq ID $place = 0; }else{ push @strings, [$strName, $_]; # push seq ID, seq $place = 1; } }

        Extrapolating from BrowserUk's estimate in Re^5: Search for identical substrings (58 hours for the 300/3k data set) and my estimate that this code is about 7000 times faster than that, the total run time should be of the order of 30 seconds. If it is more than an hour something is very wrong. Even if it is more than a few minutes our understanding is flawed or there is a bug that wasn't shown by the six string data set.

        Perl is Huffman encoded by design.