Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Fast common substring matching

by GrandFather (Saint)
on Sep 18, 2005 at 09:53 UTC ( [id://492993]=note: print w/replies, xml ) Need Help??


in reply to Re: Fast common substring matching
in thread Fast common substring matching

Another update fixing the last round of issues.

use strict; use warnings; use Time::HiRes; use List::Util qw(min max); use Math::Pari qw(divisors); =head Written =cut my $allLCS = 1; my $subStrSize = 2; # Determines minimum match length. Should be less +than half # the minimum interesting match length. The larger this value is the f +aster the # search runs. if (@ARGV == 0) { print "Finds longest matching substring between any pair of test s +trings\n"; print "in the given file. Pairs of lines are expected with the fir +st of a\n"; print "pair being the string name and the second the test string." +; exit (1); } print "Minimum match length is $subStrSize\n"; my @strings; # Outside the loop so subs see it while (@ARGV) {# process each file # Read in the strings my $filename = shift; print "\nProcessing: $filename\n"; @strings = (); open inFile, "< $filename"; while (<inFile>) { chomp; my $strName = $_; $_ = <inFile>; chomp; push @strings, [$strName, $_]; } close inFile; my $lastStr = @strings - 1; my %bestMatches = ('len' => 0); # Best match details my $longest = 0; # Best match length so far (unexpanded) my $startTime = [Time::HiRes::gettimeofday ()]; # Do the search for my $curStr (0..$lastStr) {# each string my ($sourceName, $source) = @{$strings[$curStr]}; my @subStrs = generatePatterns ($source); my $lastSub = @subStrs-1; for my $targetStr (($curStr+1)..$lastStr) {# each remaining string my ($targetName, $target) = @{$strings[$targetStr]}; my $targetLen = length $target; my $localLongest = 0; my @localBests = [(0, 0, 0, 0, 0)]; for my $i (0..$lastSub) { my $offset = 0; while ($offset < $targetLen) { $offset = index $target, $subStrs[$i][0], $offset; last if $offset < 0; my $matchStr2 = substr $target, $offset; my $slipage = 0; my $bestSlip = 0; my $matchLen = 0; my $first = 1; while ($first || $slipage < $subStrSize && $subStrs[ +$i][1] < $subStrSize) { my $matchStr1 = substr $source, $i * $subStrSize - + $slipage; ($matchStr1 ^ $matchStr2) =~ /^\0*/; if ($matchLen < $+[0]) { $bestSlip = $slipage; $matchLen = $+[0]; } $slipage += $subStrs[$i][1]; $first = 0; } next if $matchLen < $localLongest - $subStrSize + 1; $localLongest = $matchLen; my @test = ($curStr, $targetStr, $i * $subStrSize - +$bestSlip, $offset, $matchLen); @test = expandMatch (@test); my $dm = $test[4] - $localBests[-1][4]; @localBests = () if $dm > 0; push @localBests, [@test] if $dm >= 0; $offset = $test[3] + $test[4]; next if $test[4] < $longest; $longest = $test[4]; $dm = $longest - $bestMatches{'len'}; next if $dm < 0; %bestMatches = ('len' => $test[4]) if $dm > 0; $bestMatches{"$test[0],$test[1],$test[2],$test[3]"} += $test[4]; $bestMatches{'len'} = $test[4]; } continue {++$offset;} } next if ! $allLCS; if (! @localBests) { print "Didn't find LCS for $sourceName and $targetName\n +"; next; } for (@localBests) { my @curr = @$_; printf "%03d:%03d L[%4d] (%4d %4d)\n", $curr[0], $curr[1], $curr[4], $curr[2], $curr[3]; } } } print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n" +; my $len = $bestMatches{'len'}; for (keys %bestMatches) { next if $_ eq 'len'; my @curr = split ',', $_; printf "Best match: %s - %s. %d characters starting at %d and %d.\ +n", $strings[$curr[0]][0], $strings[$curr[1]][0], $len, $curr[2], $cur +r[3]; } } sub expandMatch { my ($index1, $index2, $str1Start, $str2Start, $matchLen) = @_; my $maxMatch = max (0, min ($str1Start, $subStrSize + 10, $str2Start)) +; my $matchStr1 = substr ($strings[$index1][1], $str1Start - $maxMatch, +$maxMatch); my $matchStr2 = substr ($strings[$index2][1], $str2Start - $maxMatch, +$maxMatch); ($matchStr1 ^ $matchStr2) =~ /\0*$/; my $adj = $+[0] - $-[0]; $matchLen += $adj; $str1Start -= $adj; $str2Start -= $adj; return ($index1, $index2, $str1Start, $str2Start, $matchLen); } sub generatePatterns { my @subStrs; my $source = shift; my %strs; for (my $i = 0; $i < (length $source) - $subStrSize + 1; $i += $subStr +Size) { my $substr = substr $source, $i, $subStrSize; my ($cycleLen, $str) = findCycle ($substr); push @subStrs, [$substr, $cycleLen]; } #push @subStrs, [$_, $strs{$_}] for keys %strs; return @subStrs; } sub findCycle { my $str = shift; my $copy = $str; my $cycleLen = 0; my $strLen = length ($copy); for (0..($strLen - 1)) { $copy .= substr $copy, 0, 1, ''; $cycleLen = $_ + 1; ($str ^ $copy) =~ /^\0*/; return wantarray ? ($cycleLen, substr $str, 0, $cycleLen) : $cycleLen if $+[0] == $strLen; } return wantarray ? ($strLen, $str) : $strLen; } sub findCycle_1 { my $str = shift; my $strLen = length $str; for ( @{ divisors( $strLen ) } ) { my $copy = $str; $copy .= substr( $copy, 0, $_, '' ); return wantarray ? ($_, substr $str, 0, $_) : $_ if $str eq $copy; } }

Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re^3: Fast common substring matching
by Roy Johnson (Monsignor) on Nov 14, 2005 at 21:42 UTC
    I came up with an algorithm inspired by bzip's algorithm of generating all substrings and then sorting them. I tried yours on a list of 20 strings of 1000 chars, and it ran in 153 seconds. Mine ran in 0.67 seconds, yielding the same results. 30 strings of 3000 chars runs in 20.3 seconds on mine; scaling up from there starts to get painful, but I would guess the OP's requirement of 300 strings of 3000 chars would run in under an hour, if it had plenty of memory (there will be 900,000 strings averaging 1500 chars in length).

    Give it a whirl.


    Caution: Contents may have been coded under pressure.

      Roy

      There is one difference between your algorithm and Grandfather's. His code returns the longest substring for each pair of input strings.

      With my original data set your code returns one substring. Grandfather's code returned over three thousand (where $minmatch = 256). On the other hand your code finds multiple occurrences of the longest common substrings, if they all have the same length, which I like.

      Mike

        Yes, after I came up with my algorithm, I realized what all the output from GrandFather's code meant. I had thought it was just some sort of cryptic progress meter. :-)

        The (reasonably) obvious way to get the longest substring for each pair of input strings would be to run my algorithm using each pair of strings as input rather than the whole list of strings. That's probably more work than GF's method, though. I thought about trying it, but something shiny caught my attention...

        Update: but now I've done it. It runs on 20 strings of 1000 characters in something under 10 seconds for me. 100 strings of 1000 characters takes about 4 minutes.


        Caution: Contents may have been coded under pressure.

      Greetings Roy Johnson,

      I learn more Perl from reading your code. It took me a while, but somehow overlooked the line sorting $strings before walking through the list.

      Today, came to realization that Perl can do bitwise operations on strings.

      my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) =~ /^ +(\0*)/;

      Thank you for this.

        Update: Added options chunk_size and bounds_only to MCE::Shared::Sequence in trunk, similar to MCE options. This allows MCE::Hobo workers to run as fast as MCE workers. Also, corrected the demonstration. Seeing this run faster than serial code made my day.

        I had to try something with the upcoming MCE 1.7 release. Parallelism may be beneficial for big sequences. MCE 1.7 will ship with MCE::Hobo, a threads-like module for processes. Thus, benefiting from Copy-on-Write feature of modern OS'es. In essence, the @strings array is not copied per each worker unless written to by the worker.

        Using Roy Johnson's demonstration, made the following changes to enable parallelism via MCE::Hobo workers. This requires MCE in trunk or a later dev 1.699_011 release.

        ... print "Sorted. Finding matches...\n"; # Now walk through the list. The best match for each string will be th +e # previous or next element in the list that is not from the original s +ubstring, # so for each entry, just look for the next one. See how many initial +letters # match and track the best matches # # my @matchdata = (0); # (length, index1-into-strings, index2-into-str +ings) # for my $i1 (0..($#strings - 1)) { # my $i2 = $i1 + 1; # ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i1] +[1]; # next if $i2 > $#strings; # my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) = +~ /^(\0*)/; # if ($common > $matchdata[0]) { # @matchdata = ($common, [$i1, $i2]); # } # elsif ($common == $matchdata[0]) { # push @matchdata, [$i1, $i2]; # } # } use MCE::Hobo; use MCE::Shared; my $sequence = MCE::Shared->sequence( { chunk_size => 500, bounds_only => 1 }, 0, $#strings - 1 ); sub walk_list { my @matchdata = (0); # (length, index1-into-strings, index2-into-str +ings) while ( my ( $beg, $end ) = $sequence->next ) { for my $i1 ( $beg .. $end ) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i +1][1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) + =~ /^(\0*)/; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } } return @matchdata; }; MCE::Hobo->create( \&walk_list ) for 1 .. 8; my @matchdata = (0); # (length, index1-into-strings, index2-into-strin +gs) for my $hobo ( MCE::Hobo->list ) { my @ret = $hobo->join; if ( $ret[0] > $matchdata[0] ) { @matchdata = @ret; } elsif ( $ret[0] == $matchdata[0] ) { shift @ret; push @matchdata, @ret; } } print "Best match: $matchdata[0] chars\n"; ...

        MCE 1.7 is nearly completed in trunk. The MCE::Shared::Sequence module is helpful. I will try to finish MCE 1.7 by the end of the month.

        Regards, Mario

        Update: The update to MCE::Shared::Sequence in trunk allows MCE::Hobo workers to run as fast as MCE workers. Thank you for this. The MCE::Hobo demonstration made me realized the need to beef up MCE::Shared::Sequence with chunk_size and bounds_only options similar to MCE options.

        Using Roy Johnson's demonstration, made the following changes to enable parallelism via MCE::Loop.

        ... print "Sorted. Finding matches...\n"; use MCE::Loop; MCE::Loop::init( max_workers => 8, chunk_size => 500, bounds_only => 1, ); my @ret = mce_loop_s { my ( $mce, $seq, $chunk_id ) = @_; my @matchdata = (0); # (length, index1-into-strings, index2-into-str +ings) for my $i1 ( $seq->[0] .. $seq->[1] ) { my $i2 = $i1 + 1; ++$i2 while $i2 <= $#strings and $strings[$i2][1] eq $strings[$i1] +[1]; next if $i2 > $#strings; my ($common) = map length, ($strings[$i1][0] ^ $strings[$i2][0]) = +~ /^(\0*)/; if ($common > $matchdata[0]) { @matchdata = ($common, [$i1, $i2]); } elsif ($common == $matchdata[0]) { push @matchdata, [$i1, $i2]; } } MCE->gather( \@matchdata ); } 0, $#strings - 1; my @matchdata = (0); # (length, index1-into-strings, index2-into-strin +gs) for my $i ( 0 .. $#ret ) { if ( $ret[$i]->[0] > $matchdata[0] ) { @matchdata = @{ $ret[$i] }; } elsif ( $ret[$i]->[0] == $matchdata[0] ) { shift @{ $ret[$i] }; push @matchdata, @{ $ret[$i] }; } } print "Best match: $matchdata[0] chars\n"; ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://492993]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-03-28 12:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found