Re^3: Fast common substring matching

Replies are listed 'Best First'.
Re^4: Fast common substring matching by bioMan (Beadle) on Nov 29, 2005 at 16:53 UTC
Roy There is one difference between your algorithm and Grandfather's. His code returns the longest substring for each pair of input strings. With my original data set your code returns one substring. Grandfather's code returned over three thousand (where $minmatch = 256). On the other hand your code finds multiple occurrences of the longest common substrings, if they all have the same length, which I like. Mike	[reply]
Re^5: Fast common substring matching by Roy Johnson (Monsignor) on Nov 29, 2005 at 17:08 UTC
Yes, after I came up with my algorithm, I realized what all the output from GrandFather's code meant. I had thought it was just some sort of cryptic progress meter. `:-)` The (reasonably) obvious way to get the longest substring for each pair of input strings would be to run my algorithm using each pair of strings as input rather than the whole list of strings. That's probably more work than GF's method, though. I thought about trying it, but something shiny caught my attention... Update: but now I've done it. It runs on 20 strings of 1000 characters in something under 10 seconds for me. 100 strings of 1000 characters takes about 4 minutes. Read more... (4 kB) Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^6: Fast common substring matching by bioMan (Beadle) on Nov 29, 2005 at 22:37 UTC
I had thought it was just some sort of cryptic progress meter. :-) LOL - I know what you mean. I'm still going over your original code to see how you did what you did -- trying to learn some perl :-) I'll give the new code a try. I also see that the minimum length in your code doesn't have to be a power of 2. This should allow me to analyze a limit boundary that appears to be present in my data. Grandfather's code allowed me to come up with what I feel is a pretty good estimate for the value of the limit, but this should allow a closer examination of the limit. Mike	[reply]
Re^7: Fast common substring matching by GrandFather (Saint) on Nov 29, 2005 at 22:56 UTC
Re^7: Fast common substring matching by bioMan (Beadle) on Nov 30, 2005 at 19:33 UTC
Re^6: Fast common substring matching by marioroy (Prior) on Feb 18, 2016 at 00:01 UTC
Update: Important on Windows is starting the shared-manager process immediately if construction for the shared variable comes after loading data. Unix platforms benefit from Copy-on-Write feature which is great. ... use MCE::Hobo; use MCE::Shared; # For minimum memory consumption, start the shared-manager process bef +ore # loading data locally. MCE::Shared->start(); # <-- important on Windows my $minmatch = 4; my $startTime = [Time::HiRes::gettimeofday ()]; my %strings; while (<>) { chomp(my $label = $_); chomp(my $string = <>); # Compute all substrings @{$strings{$label}} = map [substr($string, $_), $label, $_], 0..(len +gth($string) - $minmatch); } print "Loaded. Generating combos...\n"; my @keys = sort keys %strings; my $sequence = MCE::Shared->sequence( { chunk_size => 1, bounds_only => 1 }, 0, $#keys - 1 ); ... [download] Hello Roy Johnson, I am fascinated by the various examples posted here, here, and also the Inline C demonstration. Your 2nd demonstration scales wonderfully on multiple cores after loading the strings hash. For testing, I made a file containing 48 sequences. The serial and parallel code complete in 22.6 seconds and 6.1 seconds respectively. My laptop has 4 real cores plus 4 hyper-threads. First, the construction for MCE::Hobo. This requires a later 1.699_011 dev release or soon after the final MCE 1.7 release. ... print "Loaded. Generating combos...\n"; my @keys = sort keys %strings; # Now walk through the list. The best match for each string will be th +e # previous or next element in the list that is not from the original s +ubstring, # so for each entry, just look for the next one. See how many initial +letters # match and track the best matches use MCE::Hobo; use MCE::Shared; my $sequence = MCE::Shared->sequence( { chunk_size => 1, bounds_only => 1 }, 0, $#keys - 1 ); sub walk_list { my @best_overall_match = (0); # $beg and $end have the same values when chunk_size => 1 while ( my ( $beg, $end ) = $sequence->next ) { for my $ki1 ( $beg .. $end ) { for my $ki2 (($ki1 + 1)..$#keys) { my @strings = sort {$a->[0] cmp $b->[0]} @{$strings{$keys[$ki1 +]}}, @{$strings{$keys[$ki2]}}; my @matchdata = (0); # (length, index1-into-strings, index2-in +to-strings) for my $i1 (0..($#strings - 1)) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $string +s[$i1][1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2] +[0]) =~ /^(\0)/; next if $common < $minmatch; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } next if $matchdata[0] < $minmatch; if ($matchdata[0] > $best_overall_match[0]) { @best_overall_match = ($matchdata[0]); } if ($matchdata[0] >= $best_overall_match[0]) { push @best_overall_match, map { ["$strings[$_->[0]][1]:$strings[$_->[0]][2]", "$strings[$_ +->[1]][1]:$strings[$_->[1]][2]"] } @matchdata[1..$#matchdata]; } } # $ki2 } # $ki1 } return @best_overall_match; }; MCE::Hobo->create( \&walk_list ) for 1 .. 8; my @best_overall_match = (0); for my $hobo ( MCE::Hobo->list ) { my @ret = $hobo->join; if ( $ret[0] > $best_overall_match[0] ) { @best_overall_match = @ret; } elsif ( $ret[0] == $best_overall_match[0] ) { shift @ret; push @best_overall_match, @ret; } } print "Best overall match: $best_overall_match[0] chars\n"; ... [download] MCE::Loop is next and does the same thing. ... print "Loaded. Generating combos...\n"; my @keys = sort keys %strings; # Now walk through the list. The best match for each string will be th +e # previous or next element in the list that is not from the original s +ubstring, # so for each entry, just look for the next one. See how many initial +letters # match and track the best matches use MCE::Loop; MCE::Loop::init( max_workers => 8, chunk_size => 1, bounds_only => 1, ); my @ret = mce_loop_s { my ( $mce, $seq, $chunk_id ) = @_; my @best_overall_match = (0); # $seq->[0] and $seq->[1] have the same values when chunk_size => 1 for my $ki1 ( $seq->[0] .. $seq->[1] ) { for my $ki2 (($ki1 + 1)..$#keys) { my @strings = sort {$a->[0] cmp $b->[0]} @{$strings{$keys[$ki1]} +}, @{$strings{$keys[$ki2]}}; my @matchdata = (0); # (length, index1-into-strings, index2-into +-strings) for my $i1 (0..($#strings - 1)) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[ +$i1][1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0 +]) =~ /^(\0)/; next if $common < $minmatch; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } next if $matchdata[0] < $minmatch; if ($matchdata[0] > $best_overall_match[0]) { @best_overall_match = ($matchdata[0]); } if ($matchdata[0] >= $best_overall_match[0]) { push @best_overall_match, map { ["$strings[$_->[0]][1]:$strings[$_->[0]][2]", "$strings[$_-> +[1]][1]:$strings[$_->[1]][2]"] } @matchdata[1..$#matchdata]; } } # $ki2 } # $ki1 MCE->gather(\@best_overall_match); } 0, $#keys - 1; MCE::Loop::finish; my @best_overall_match = (0); for my $i ( 0 .. $#ret ) { if ($ret[$i]->[0] > $best_overall_match[0]) { @best_overall_match = @{ $ret[$i] }; } elsif ( $ret[$i]->[0] == $best_overall_match[0] ) { shift @{ $ret[$i] }; push @best_overall_match, @{ $ret[$i] }; } } print "Best overall match: $best_overall_match[0] chars\n"; ... [download] This has been a lot of fun. I learned some more Perl from it all. Regards, Mario	[reply] [d/l] [select]
Re^4: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 06:39 UTC
Greetings Roy Johnson, I learn more Perl from reading your code. It took me a while, but somehow overlooked the line sorting $strings before walking through the list. Today, came to realization that Perl can do bitwise operations on strings. `my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ /^ +(\0*)/;` [download] Thank you for this.	[reply] [d/l]
Re^5: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 07:36 UTC
Update: Added options chunk_size and bounds_only to MCE::Shared::Sequence in trunk, similar to MCE options. This allows MCE::Hobo workers to run as fast as MCE workers. Also, corrected the demonstration. Seeing this run faster than serial code made my day. I had to try something with the upcoming MCE 1.7 release. Parallelism may be beneficial for big sequences. MCE 1.7 will ship with MCE::Hobo, a threads-like module for processes. Thus, benefiting from Copy-on-Write feature of modern OS'es. In essence, the @strings array is not copied per each worker unless written to by the worker. Using Roy Johnson's demonstration, made the following changes to enable parallelism via MCE::Hobo workers. This requires MCE in trunk or a later dev 1.699_011 release. ... print "Sorted. Finding matches...\n"; # Now walk through the list. The best match for each string will be th +e # previous or next element in the list that is not from the original s +ubstring, # so for each entry, just look for the next one. See how many initial +letters # match and track the best matches # # my @matchdata = (0); # (length, index1-into-strings, index2-into-str +ings) # for my $i1 (0..($#strings - 1)) { # my $i2 = $i1 + 1; # ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i1] +[1]; # next if $i2 > $#strings; # my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) = +~ /^(\0)/; # if ($common > $matchdata[0]) { # @matchdata = ($common, [$i1, $i2]); # } # elsif ($common == $matchdata[0]) { # push @matchdata, [$i1, $i2]; # } # } use MCE::Hobo; use MCE::Shared; my $sequence = MCE::Shared->sequence( { chunk_size => 500, bounds_only => 1 }, 0, $#strings - 1 ); sub walk_list { my @matchdata = (0); # (length, index1-into-strings, index2-into-str +ings) while ( my ( $beg, $end ) = $sequence->next ) { for my $i1 ( $beg .. $end ) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i +1][1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) + =~ /^(\0)/; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } } return @matchdata; }; MCE::Hobo->create( \&walk_list ) for 1 .. 8; my @matchdata = (0); # (length, index1-into-strings, index2-into-strin +gs) for my $hobo ( MCE::Hobo->list ) { my @ret = $hobo->join; if ( $ret[0] > $matchdata[0] ) { @matchdata = @ret; } elsif ( $ret[0] == $matchdata[0] ) { shift @ret; push @matchdata, @ret; } } print "Best match: $matchdata[0] chars\n"; ... [download] MCE 1.7 is nearly completed in trunk. The MCE::Shared::Sequence module is helpful. I will try to finish MCE 1.7 by the end of the month. Regards, Mario	[reply] [d/l]
Re^5: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 08:34 UTC
Update: The update to MCE::Shared::Sequence in trunk allows MCE::Hobo workers to run as fast as MCE workers. Thank you for this. The MCE::Hobo demonstration made me realized the need to beef up MCE::Shared::Sequence with chunk_size and bounds_only options similar to MCE options. Using Roy Johnson's demonstration, made the following changes to enable parallelism via MCE::Loop. ... print "Sorted. Finding matches...\n"; use MCE::Loop; MCE::Loop::init( max_workers => 8, chunk_size => 500, bounds_only => 1, ); my @ret = mce_loop_s { my ( $mce, $seq, $chunk_id ) = @_; my @matchdata = (0); # (length, index1-into-strings, index2-into-str +ings) for my $i1 ( $seq->[0] .. $seq->[1] ) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i1] +[1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) = +~ /^(\0*)/; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } MCE->gather( \@matchdata ); } 0, $#strings - 1; my @matchdata = (0); # (length, index1-into-strings, index2-into-strin +gs) for my $i ( 0 .. $#ret ) { if ( $ret[$i]->[0] > $matchdata[0] ) { @matchdata = @{ $ret[$i] }; } elsif ( $ret[$i]->[0] == $matchdata[0] ) { shift @{ $ret[$i] }; push @matchdata, @{ $ret[$i] }; } } print "Best match: $matchdata[0] chars\n"; ... [download]	[reply] [d/l]
Re^4: Fast common substring matching by marioroy (Prior) on Feb 17, 2016 at 20:31 UTC
Greetings Roy Johnson, I was drawn to your example and wanted to try enabling parallelism. Well, I'm happy to report that it works quite well. MCE::Hobo demonstration MCE::Loop demonstration Parallelism is likely beneficial for larger sequences. Regards, Mario	[reply]


Don't ask to ask, just ask
	PerlMonks