Well BrowserUk im afraid the numbers just dont agree with you at all.
I put together a hybrid Trie/DFA solution, (code is in the readmore at the bottom of this node). It shows quite clearly that the Trie/DFA will win in many cases against your XOR approach. Even though my trie implementation is Pure Perl (with an extremely inefficient DFA construction routine) it outperforms your algorithm in every circumstance I tested it in _except_ the 25/1000/1000 tests that you were using. And even then when I make the string being searched just 10 times larger XOR loses outright. Also any possible benefits that may accrue from OS level file caching goes to benefit the XOR algorithm as because the Trie algorithm most often finished first I put it first in the test code. I didnt have sufficient tuits to redo the implementation using Hybrid Perl/Inline::C which would have just increased the difference between the routines.
Basically the performance for the Trie/DFA is nonlinear for construction, and linear for the actual scan. This of course means that once constructed the search time for any given piece of data will theoretically be some constant times the length of the string and in practice we see some pretty much that with some modest change probably due to memory management costs. Alternate and more compact representations could be chosen to make this more efficient. Since the constructed Trie can be serialized for reuse the construction costs could in many situations be essentially ignored. For this reason i seperated out the time actually running the state machine from the time constructing it.
I will admit that there are two subtle differences between the results from the two approaches. Yours will show mutliple hits for a single offset if they exist wheras mine will only show one. I dont consider this in any way to make your solution superior as it is a fairly trivial issue to resolve as one could simply maintain a hash of all the keys in the trie and any other keys that are anagrams of that key. Similarly this particular implementation doesnt necessarily handle strings of different sizes as one might expect. As this is meant to be a proof of concept and the implementation you posted is hardcoded to handle 25 char words and as such suffers the same problem I didn't see that as a show stopper.
I should say lastly that calling this a DFA not be entirely correct. Possibly a better term would FSA (finate state automata). The reason being that normally neither NFA or DFA regex engines will match overlapping words. Ie if the string is "ABCABD" and we have 'BC' and 'CA' as accepting strings then it would match ONLY "BC". My implementation doesnt care about that and would match both. Regardless there are any number of ways to build a state machine for the purpose at hand and mine is probably one of the least efficient.
The following table shows the various test scenarios that I used. (All times are in seconds.) The last column shows how much longer the XOR solution took than the Trie solution. Where the number is negative (and in bold) is where the XOR based solution beats the Trie based soltion. You'll notice that only one scenario is in this category: that being when the strings are 1000 chars and the words are 25 chars. A ratio of 1:40. However once we use a ratio of 1:100 or so this reverses itself quite drammatically.
The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours. For something like searching vast quantities of DNA strings I imagine a technique like this would do quite well. OTOH this is a trade off. I can imagine scenarios where your approach would be better. But for hardcore pattern matching on modern machines with lots of ram id be looking at a DFA. Notice that by the time we start talking about 100k being searched for 10 words the trie runs in 26 minutes less. And this is Pure Perl and in the opinion of the author a pig. :-)
| Criteria | Xor | Trie | Scan | Construct | Xor / Trie | ||
|---|---|---|---|---|---|---|---|
| Word Size | String Size | Word Count | |||||
| 10 | 1000 | 1 | 3.8472 | 2.1892 | 2.0378 | 0.1514 | 1.6579 |
| 10 | 1000 | 2 | 5.4613 | 2.4221 | 2.0971 | 0.3250 | 3.0392 |
| 10 | 1000 | 5 | 10.4136 | 3.1230 | 2.4561 | 0.6669 | 7.2906 |
| 10 | 1000 | 10 | 18.8692 | 3.8036 | 2.5979 | 1.2058 | 15.0656 |
| 10 | 10000 | 1 | 36.7216 | 19.7340 | 19.5802 | 0.1538 | 16.9876 |
| 10 | 10000 | 2 | 53.5071 | 20.2742 | 19.9636 | 0.3106 | 33.2329 |
| 10 | 10000 | 5 | 103.6078 | 23.0304 | 22.3537 | 0.6766 | 80.5774 |
| 10 | 10000 | 10 | 182.2624 | 25.6677 | 24.4078 | 1.2599 | 156.5947 |
| 10 | 100000 | 1 | 375.8999 | 198.2450 | 198.0712 | 0.1737 | 177.6549 |
| 10 | 100000 | 2 | 544.4644 | 205.7155 | 205.3858 | 0.3297 | 338.7488 |
| 10 | 100000 | 5 | 1027.3750 | 219.6740 | 218.9498 | 0.7242 | 807.7010 |
| 10 | 100000 | 10 | 1829.9289 | 246.2426 | 244.8240 | 1.4185 | 1583.6864 |
| 25 | 1000 | 1 | 3.8653 | 8.0640 | 2.0574 | 6.0066 | -4.1987 |
| 25 | 1000 | 2 | 5.6597 | 13.6144 | 2.1015 | 11.5129 | -7.9548 |
| 25 | 1000 | 5 | 11.8086 | 32.7390 | 2.3764 | 30.3625 | -20.9304 |
| 25 | 1000 | 10 | 19.6336 | 63.8928 | 2.6036 | 61.2892 | -44.2592 |
| 25 | 10000 | 1 | 38.3304 | 26.9714 | 19.8883 | 7.0832 | 11.3590 |
| 25 | 10000 | 2 | 56.5971 | 32.9522 | 21.3281 | 11.6241 | 23.6449 |
| 25 | 10000 | 5 | 108.0958 | 52.9627 | 23.1748 | 29.7879 | 55.1330 |
| 25 | 10000 | 10 | 196.6709 | 80.1240 | 24.9569 | 55.1671 | 116.5469 |
| 25 | 100000 | 1 | 393.9894 | 202.1637 | 196.5024 | 5.6614 | 191.8256 |
| 25 | 100000 | 2 | 576.6479 | 218.8988 | 205.9747 | 12.9241 | 357.7492 |
| 25 | 100000 | 5 | 1067.7821 | 254.2639 | 223.5472 | 30.7167 | 813.5182 |
| 25 | 100000 | 10 | 1959.1590 | 315.1286 | 249.9723 | 65.1563 | 1644.0305 |
use strict; use warnings; use Time::HiRes qw(time); use POSIX qw(strftime); use Getopt::Long; use vars qw/ $VERBOSE /; $VERBOSE=0; exit(main(@ARGV)); BEGIN { # all of this gunk is for timing and logging stuff my @time=(time); my @comment=('Start'); sub mark_time { push @time,time; push @comment,join "",@_; print @_, sprintf " <%9.4f secs>\n",$time[-1]-$time[-2] if $VERBOSE; return $time[-1]; } sub clear_time { push @time,time; push @comment,'Finish'; @comment=map { s/[\t\r\n]/ /g; $_} @comment; printf STDERR "%%#%% %s %s %s\n",strftime("%H:%M:%S",localtime($ti +me[$_])), $_ ? sprintf " (%9.4f : %9.4f)",$time[$_]-$time[$_-1],$time[$_ +]-$time[0] : '###', ($comment[$_]||'') for 0..$#time; my $total=$time[-1]-$time[0]; print "Total Elapsed Time: ".$total." secs\n"; @time=(time); @comment=(shift||'Start'); return $total; } sub elapsed { my $time=shift; $time||=$time[-1]; print @_, sprintf " [%9.4f secs]\n",time-$time; return time; } END { clear_time(); } } sub search { my ($node,$string,$idx,$maximum)=@_; $idx=0 unless defined $idx; my @match; my $matches; $maximum||=50; #print "\nsearch($node,$string,$idx)\n" if $VERBOSE<0; for (;$idx<length($string);$idx++) { my $char=substr($string,$idx,1); #print "$idx : $char : $node->{id} : ",($node->{value}?$node->{val +ue}:"''"),"\n" # if $VERBOSE<0; if ( $node->{value} ) { push @match,[$idx-length($node->{string}),${$node->{value}},$nod +e->{fuzz},$node] if $matches++<$maximum; } $node=$node->{$char}||$node->{-$char}; die $char,$idx unless $node; } if ( $node->{value} ) { push @match,[$idx-length($node->{string}),${$node->{value}},$node- +>{fuzz},$node] if $matches++<$maximum; } return $matches,@match; } BEGIN { my $IDX=1; sub get_idx(){"$IDX"} # this builds the trie sub hash_trie_store { my ($node,$string,$value,$fuzz)=@_; $node->{id}||=$IDX++; my $dot=0; foreach my $char (split //,$string) { unless ($node->{$char}) { $node->{$char}={id=>$IDX++}; } $node=$node->{$char}; } $node->{value}=\$_[2]; $node->{fuzz}=$fuzz if defined $fuzz; $node->{string}=$string; #print "Store: $string:$value:$fuzz\n"; $node->{id} } } # this makes fuzzy words from a string and then inserts into the trie. sub make_variants { my ($string,$depth,$chars,$hash,$trie,$rdepth,$rstring)=@_; my $ltrs=ref $chars ? $chars : [ split //, $chars ]; my @char=split//,$string; $trie||={}; $hash||={}; $rdepth||=0; my $count=0; my @follow; $rstring||=$string; if (!$rdepth) { $hash->{$string}=hash_trie_store($trie,$string,$rstring,$rdepth) +; $count++; #print "* $string, $rdepth\n"; } foreach my $idx (0..$#char) { foreach my $alt (@$ltrs) { next if $alt eq $char[$idx]; local $char[$idx]=$alt; my $str=join "",@char; if (!$hash->{$str}) { $hash->{$str}=hash_trie_store($trie,$str,$rstring,$rdepth+1); $count++; push @follow,$str if $depth>1; #print "$str, ",$rdepth+1,"\n"; } } } foreach my $str (@follow) { $count+=make_variants($str,$depth-1,$ltrs,$hash,$trie,$rdepth+1,$ +rstring); } return $count; } # this converts a trie into a "dfa" capable of doing the overlapping m +atching # note it doesnt handle strings of differing size correctly so dont us +e it for that. sub traverse { my ($root,$chars,$str,$hash)=@_; $hash||=$root; $str="" unless defined $str; my $ltrs=ref $chars ? $chars : [ split //, $chars ]; foreach my $k (@$ltrs) { if ($hash->{$k}) { traverse($root,$chars,$str.$k,$hash->{$k}); } else { if (length($str)>1) { my @chars=(split(//,$str),$k); while (defined shift @chars) { my $csr=$root; for my $c (@chars) { last unless $csr=$csr->{$c}||$csr->{-$c}; } if ($csr) { $hash->{-$k}=$csr; last } } } $hash->{-$k}||=$root; } } return } # slighlty modified version of BrowserUk's code some changes necessary + becuase # we allow for differing length words and also because Perl 5.6.2 comp +lained. sub xor_fuzz { my ($length,$FUZZY,$fuzzfile,$stringfile,$maximum)=@_; use bytes; $FUZZY||=2; $maximum||=50; mark_time "xor_fuzz loading '$fuzzfile'\n"; open my $FUZ, '<', $fuzzfile or die "$fuzzfile : $!"; my %fuz; while( <$FUZ> ) { chomp; $fuz{ $_ } = ''; } close $FUZ; mark_time "Loaded ${ \scalar keys %fuz } $length-ers\n"; open my $SEQ, '<', $stringfile or die "$stringfile : $!"; my $totalLen = 0; my $fuzzyComps = 0; while( my $seq = <$SEQ> ) { my @matches; my $matches=0; chomp $seq; $totalLen += length $seq; for my $offset ( 0 .. length( $seq ) - $length ) { my $ssref = \substr( $seq, $offset, $length ); #printf STDERR "\rProcessing sequence %5d offset %05d", $., +$offset; for my $fuz ( keys %fuz ) { $fuzzyComps++; my $m = $length - (my $x=( $fuz ^ $$ssref )) =~ tr[\0][ +\0]; if( $m <= $FUZZY ) { push @matches,[$offset,$fuz,$m] if $matches++<$maximum; } } } mark_time sprintf "XOR #%04d: %04d > %s",$.,$matches,@matches ? + ": ". join(", ",map{ sprintf "%06d:%s:%d",@{$_}[0 +..2] } @matches) : "None."; } mark_time "\n\nProcessed $. sequences\nAverage length: ", $totalLen / $.,"\n"."Total fuzzy comparisons: ", $fuzzyComps,"\n"; close $SEQ; } #utility function sub write_file(&@) { my ($sub,$file,$count,$override)=@_; return if !$override and -e $file; open my $fh,">","$file.tmp" or die "Error writing to '$file.tmp':$!" +; print "Writing '$file'\n" if $VERBOSE; for (1..$count) { print $fh $sub->(),"\n" or die "Failed to print to '$file'\n"; } close $fh or die "Failed to close '$file'\n"; rename "$file","$file.bak" or die "Backup Rename failed:$!" if -e $file; rename "$file.tmp","$file" or die "Tmp Rename failed:$!"; mark_time "File Complete: '$file'\n"; } sub main { $|++; my $Fuzz=2; my $Words=1; my $Word_Size=25; my $Strings=1000; my $String_Size=1000; my $OverWrite=0; GetOptions('fuzz=i' => \$Fuzz, 'words=i' => \$Words, 'strings=i' => \$Strings, 'word_size=i' => \$Word_Size, 'string_size=i'=> \$String_Size, 'over_write!' => \$OverWrite, 'verbose!' => \$VERBOSE) or die "Bad Options\n"; my @Chars=qw(A C T G); my $Word_File=sprintf "Fuzz-Words-W%04d-S%04d-WC%04d-SC%04d.fuzz",$W +ord_Size,$String_Size,$Words,$Strings; my $String_File=sprintf "Fuzz-Strings-W%04d-S%04d-WC%04d-SC%04d.fuzz +",$Word_Size,$String_Size,$Words,$Strings; rename $String_File,"$String_File.bak" if -e $String_File and !-e $Word_File; my @words; write_file { my $str=join "",map { $Chars[rand @Chars] } 1..$Word_Size; push @words,$str; $str; } $Word_File,$Words,$OverWrite; print "Getting strings\n" if $VERBOSE; write_file { my $str=$Chars[rand @Chars] x $String_Size; substr($str,$_-1,1)=$Chars[rand @Chars] for 1..$String_Size; for (1..$Word_Size) { my $p=int rand($String_Size-$Word_Size); my $w=$words[rand @words]; substr($str,$p,$Word_Size,$w); } $str; } $String_File,$Strings,$OverWrite; mark_time "Finished building strings\nWords: $Word_File\nStrings: $S +tring_File\n"; clear_time("Starting TRIE\n"); my $construct_time; { # Trie search @words=do{ open my $ifh,"<",$Word_File or die "Can't read Word File '$Word_File':$!"; map { chomp; $_ } <$ifh> }; my $time=mark_time "Got ".scalar(@words)." to fuzzyify\n"; print "Making fuzzy strings\n" if $VERBOSE; my (%trie,%hash); foreach my $word (@words) { print length($word).":$word:" if $VERBOSE; my $count=make_variants($word,$Fuzz,\@Chars,\%hash,\%trie); $time=elapsed($time,$count) if $VERBOSE; } @words=sort keys %hash; mark_time "Fuzzy Words:".scalar(@words)." [".get_idx()." nodes in +tree]\n"; #exit(0); print "Doing traverse...\n" if $VERBOSE; traverse(\%trie,\@Chars); mark_time "Finised Traverse"; $construct_time=clear_time("Starting Trie Scan\n"); open my $fh,"<",$String_File or die "Error reading stringfile '$String_File'\n"; while (<$fh>) { chomp(my $string=$_); my ($count,@matches)=search(\%trie,$string); mark_time sprintf "TRIE #%04d: %04d > %s",$.,$count,@matches ? " +: ". join(", ",map{ sprintf "%06d:%s:%d",@{$_}[0 +..2] } @matches) : "None."; } mark_time "Trie Done\n"; } my $scan_time=clear_time("Starting xor_fuzz search\n"); { # Xor Search xor_fuzz($Word_Size,$Fuzz,$Word_File,$String_File); } my $xor_total=clear_time("xor_fuzz search done.\n"); print STDERR "**** WordSize: $Word_Size StringSize: $String_Size Wor +dCount: $Words StringCount: $Strings Xor: $xor_total Trie: " .($scan_time+$construct_time)." ($scan_time + $construct_time) +\n"; print "**** WordSize: $Word_Size StringSize: $String_Size WordCount: + $Words StringCount: $Strings Xor: $xor_total Trie: " .($scan_time+$construct_time)." ($scan_time + $construct_time) +\n"; 0 }
And here is how it is meant to be used: (note that it writes a special log to STDERR and it also will only print out the first 50 matches for either soltion so as not to overflow memory with such redundant data).
del *.fuzz trie_xor.pl --word_size=10 --string_size=1000 --words=1 --over_write 2 +>fuzzlog-10-1000-01.fuzzlog trie_xor.pl --word_size=10 --string_size=1000 --words=2 --over_write 2 +>fuzzlog-10-1000-02.fuzzlog trie_xor.pl --word_size=10 --string_size=1000 --words=5 --over_write 2 +>fuzzlog-10-1000-05.fuzzlog trie_xor.pl --word_size=10 --string_size=1000 --words=10 --over_write +2>fuzzlog-10-1000-10.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=1 --over_write +2>fuzzlog-10-10000-01.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=2 --over_write +2>fuzzlog-10-10000-02.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=5 --over_write +2>fuzzlog-10-10000-05.fuzzlog trie_xor.pl --word_size=10 --string_size=10000 --words=10 --over_write + 2>fuzzlog-10-10000-10.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=1 --over_write + 2>fuzzlog-10-100000-01.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=2 --over_write + 2>fuzzlog-10-100000-02.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=5 --over_write + 2>fuzzlog-10-100000-05.fuzzlog trie_xor.pl --word_size=10 --string_size=100000 --words=10 --over_writ +e 2>fuzzlog-10-100000-10.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=1 --over_write 2 +>fuzzlog-25-1000-01.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=2 --over_write 2 +>fuzzlog-25-1000-02.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=5 --over_write 2 +>fuzzlog-25-1000-05.fuzzlog trie_xor.pl --word_size=25 --string_size=1000 --words=10 --over_write +2>fuzzlog-25-1000-10.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=1 --over_write +2>fuzzlog-25-10000-01.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=2 --over_write +2>fuzzlog-25-10000-02.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=5 --over_write +2>fuzzlog-25-10000-05.fuzzlog trie_xor.pl --word_size=25 --string_size=10000 --words=10 --over_write + 2>fuzzlog-25-10000-10.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=1 --over_write + 2>fuzzlog-25-100000-01.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=2 --over_write + 2>fuzzlog-25-100000-02.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=5 --over_write + 2>fuzzlog-25-100000-05.fuzzlog trie_xor.pl --word_size=25 --string_size=100000 --words=10 --over_writ +e 2>fuzzlog-25-100000-10.fuzzlog
A footnote: you mentioned a number of things about a DFA/Trie not being able to do. I suggest to you that you were mostly wrong. Adding meta data to states in a DFA is relatively easy to do. For instance accepting states can have associated data (in fact in the implementation above its the very presence of that data that indicates its an accepting state.)
In reply to Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings)
by demerphq
in thread Fuzzy Searching: Optimizing Algorithm Selection
by Itatsumaki
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |