comment on

Well BrowserUk im afraid the numbers just dont agree with you at all.

I put together a hybrid Trie/DFA solution, (code is in the readmore at the bottom of this node). It shows quite clearly that the Trie/DFA will win in many cases against your XOR approach. Even though my trie implementation is Pure Perl (with an extremely inefficient DFA construction routine) it outperforms your algorithm in every circumstance I tested it in _except_ the 25/1000/1000 tests that you were using. And even then when I make the string being searched just 10 times larger XOR loses outright. Also any possible benefits that may accrue from OS level file caching goes to benefit the XOR algorithm as because the Trie algorithm most often finished first I put it first in the test code. I didnt have sufficient tuits to redo the implementation using Hybrid Perl/Inline::C which would have just increased the difference between the routines.

Basically the performance for the Trie/DFA is nonlinear for construction, and linear for the actual scan. This of course means that once constructed the search time for any given piece of data will theoretically be some constant times the length of the string and in practice we see some pretty much that with some modest change probably due to memory management costs. Alternate and more compact representations could be chosen to make this more efficient. Since the constructed Trie can be serialized for reuse the construction costs could in many situations be essentially ignored. For this reason i seperated out the time actually running the state machine from the time constructing it.

I will admit that there are two subtle differences between the results from the two approaches. Yours will show mutliple hits for a single offset if they exist wheras mine will only show one. I dont consider this in any way to make your solution superior as it is a fairly trivial issue to resolve as one could simply maintain a hash of all the keys in the trie and any other keys that are anagrams of that key. Similarly this particular implementation doesnt necessarily handle strings of different sizes as one might expect. As this is meant to be a proof of concept and the implementation you posted is hardcoded to handle 25 char words and as such suffers the same problem I didn't see that as a show stopper.

I should say lastly that calling this a DFA not be entirely correct. Possibly a better term would FSA (finate state automata). The reason being that normally neither NFA or DFA regex engines will match overlapping words. Ie if the string is "ABCABD" and we have 'BC' and 'CA' as accepting strings then it would match ONLY "BC". My implementation doesnt care about that and would match both. Regardless there are any number of ways to build a state machine for the purpose at hand and mine is probably one of the least efficient.

The following table shows the various test scenarios that I used. (All times are in seconds.) The last column shows how much longer the XOR solution took than the Trie solution. Where the number is negative (and in bold) is where the XOR based solution beats the Trie based soltion. You'll notice that only one scenario is in this category: that being when the strings are 1000 chars and the words are 25 chars. A ratio of 1:40. However once we use a ratio of 1:100 or so this reverses itself quite drammatically.

The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours. For something like searching vast quantities of DNA strings I imagine a technique like this would do quite well. OTOH this is a trade off. I can imagine scenarios where your approach would be better. But for hardcore pattern matching on modern machines with lots of ram id be looking at a DFA. Notice that by the time we start talking about 100k being searched for 10 words the trie runs in 26 minutes less. And this is Pure Perl and in the opinion of the author a pig. :-)

Criteria			Xor	Trie	Scan	Construct	Xor / Trie
Word Size	String Size	Word Count	Xor	Trie	Scan	Construct	Xor / Trie
10	1000	1	3.8472	2.1892	2.0378	0.1514	1.6579
10	1000	2	5.4613	2.4221	2.0971	0.3250	3.0392
10	1000	5	10.4136	3.1230	2.4561	0.6669	7.2906
10	1000	10	18.8692	3.8036	2.5979	1.2058	15.0656
10	10000	1	36.7216	19.7340	19.5802	0.1538	16.9876
10	10000	2	53.5071	20.2742	19.9636	0.3106	33.2329
10	10000	5	103.6078	23.0304	22.3537	0.6766	80.5774
10	10000	10	182.2624	25.6677	24.4078	1.2599	156.5947
10	100000	1	375.8999	198.2450	198.0712	0.1737	177.6549
10	100000	2	544.4644	205.7155	205.3858	0.3297	338.7488
10	100000	5	1027.3750	219.6740	218.9498	0.7242	807.7010
10	100000	10	1829.9289	246.2426	244.8240	1.4185	1583.6864
25	1000	1	3.8653	8.0640	2.0574	6.0066	-4.1987
25	1000	2	5.6597	13.6144	2.1015	11.5129	-7.9548
25	1000	5	11.8086	32.7390	2.3764	30.3625	-20.9304
25	1000	10	19.6336	63.8928	2.6036	61.2892	-44.2592
25	10000	1	38.3304	26.9714	19.8883	7.0832	11.3590
25	10000	2	56.5971	32.9522	21.3281	11.6241	23.6449
25	10000	5	108.0958	52.9627	23.1748	29.7879	55.1330
25	10000	10	196.6709	80.1240	24.9569	55.1671	116.5469
25	100000	1	393.9894	202.1637	196.5024	5.6614	191.8256
25	100000	2	576.6479	218.8988	205.9747	12.9241	357.7492
25	100000	5	1067.7821	254.2639	223.5472	30.7167	813.5182
25	100000	10	1959.1590	315.1286	249.9723	65.1563	1644.0305

use strict;
use warnings;
use Time::HiRes qw(time);
use POSIX qw(strftime);
use Getopt::Long;
use vars qw/ $VERBOSE /;
$VERBOSE=0;
exit(main(@ARGV));

BEGIN {
  # all of this gunk is for timing and logging stuff
  my @time=(time);
  my @comment=('Start');
  sub mark_time {
    push @time,time;
    push @comment,join "",@_;
    print @_, sprintf " <%9.4f secs>\n",$time[-1]-$time[-2]
      if $VERBOSE;
    return $time[-1];
  }
  sub clear_time {
    push @time,time;
    push @comment,'Finish';
    @comment=map { s/[\t\r\n]/ /g; $_} @comment;
    printf STDERR "%%#%% %s %s %s\n",strftime("%H:%M:%S",localtime($ti
+me[$_])),
        $_ ? sprintf " (%9.4f : %9.4f)",$time[$_]-$time[$_-1],$time[$_
+]-$time[0]
           : '###',
        ($comment[$_]||'')
     for 0..$#time;

    my $total=$time[-1]-$time[0];
    print "Total Elapsed Time: ".$total." secs\n";

    @time=(time);
    @comment=(shift||'Start');
    return $total;
  }

  sub elapsed {
    my $time=shift;
    $time||=$time[-1];
    print @_, sprintf " [%9.4f secs]\n",time-$time;
    return time;
  }

  END {
    clear_time();
  }

}

sub search {
  my ($node,$string,$idx,$maximum)=@_;
  $idx=0 unless defined $idx;
  my @match;
  my $matches;
  $maximum||=50;
  #print "\nsearch($node,$string,$idx)\n" if $VERBOSE<0;

  for (;$idx<length($string);$idx++) {
    my $char=substr($string,$idx,1);

    #print "$idx : $char : $node->{id} : ",($node->{value}?$node->{val
+ue}:"''"),"\n"
    #  if $VERBOSE<0;

    if ( $node->{value} ) {
      push @match,[$idx-length($node->{string}),${$node->{value}},$nod
+e->{fuzz},$node]
        if $matches++<$maximum;
    }

    $node=$node->{$char}||$node->{-$char};
    die $char,$idx unless $node;
  }
  if ( $node->{value} ) {
    push @match,[$idx-length($node->{string}),${$node->{value}},$node-
+>{fuzz},$node]
      if $matches++<$maximum;
  }
  return $matches,@match;
}

BEGIN {
  my $IDX=1;
  sub get_idx(){"$IDX"}

  # this builds the trie

  sub hash_trie_store  {
    my ($node,$string,$value,$fuzz)=@_;
    $node->{id}||=$IDX++;
    my $dot=0;
    foreach my $char (split //,$string) {
      unless ($node->{$char}) {
        $node->{$char}={id=>$IDX++};
      }
      $node=$node->{$char};
    }
    $node->{value}=\$_[2];
    $node->{fuzz}=$fuzz if defined $fuzz;
    $node->{string}=$string;
    #print "Store: $string:$value:$fuzz\n";
    $node->{id}
  }
}


# this makes fuzzy words from a string and then inserts into the trie.
sub make_variants {
  my ($string,$depth,$chars,$hash,$trie,$rdepth,$rstring)=@_;
  my $ltrs=ref $chars ? $chars : [ split //, $chars ];
  my @char=split//,$string;
  $trie||={};
  $hash||={};
  $rdepth||=0;
  my $count=0;
  my @follow;
  $rstring||=$string;
  if (!$rdepth) {
      $hash->{$string}=hash_trie_store($trie,$string,$rstring,$rdepth)
+;
      $count++;
      #print "* $string, $rdepth\n";
  }
  foreach my $idx (0..$#char) {
    foreach my $alt (@$ltrs) {
      next if $alt eq $char[$idx];
      local $char[$idx]=$alt;
      my $str=join "",@char;
      if (!$hash->{$str}) {
        $hash->{$str}=hash_trie_store($trie,$str,$rstring,$rdepth+1);
        $count++;
        push @follow,$str
          if $depth>1;
        #print "$str, ",$rdepth+1,"\n";
      }
    }
  }

  foreach my $str (@follow) {
     $count+=make_variants($str,$depth-1,$ltrs,$hash,$trie,$rdepth+1,$
+rstring);
  }
  return $count;
}

# this converts a trie into a "dfa" capable of doing the overlapping m
+atching
# note it doesnt handle strings of differing size correctly so dont us
+e it for that.
sub traverse {
  my ($root,$chars,$str,$hash)=@_;
  $hash||=$root;
  $str="" unless defined $str;
  my $ltrs=ref $chars ? $chars : [ split //, $chars ];

  foreach my $k (@$ltrs) {
    if ($hash->{$k}) {
      traverse($root,$chars,$str.$k,$hash->{$k});
    } else {
      if (length($str)>1) {
          my @chars=(split(//,$str),$k);
          while (defined shift @chars) {
            my $csr=$root;
            for my $c (@chars) {
              last unless $csr=$csr->{$c}||$csr->{-$c};
            }
            if ($csr) {
              $hash->{-$k}=$csr;
              last
            }
          }
      }
      $hash->{-$k}||=$root;
    }
  }
  return
}

# slighlty modified version of BrowserUk's code some changes necessary
+ becuase
# we allow for differing length words and also because Perl 5.6.2 comp
+lained.
sub xor_fuzz {
  my ($length,$FUZZY,$fuzzfile,$stringfile,$maximum)=@_;
  use bytes;
  $FUZZY||=2;
  $maximum||=50;

  mark_time "xor_fuzz loading '$fuzzfile'\n";
  open my $FUZ, '<', $fuzzfile or die "$fuzzfile : $!";
  my %fuz;
  while( <$FUZ> ) {
      chomp;
      $fuz{ $_ } = '';
  }

  close $FUZ;

  mark_time "Loaded ${ \scalar keys %fuz } $length-ers\n";

  open my $SEQ, '<', $stringfile or die "$stringfile : $!";

  my $totalLen = 0;
  my $fuzzyComps = 0;
  while( my $seq = <$SEQ> ) {
      my @matches;
      my $matches=0;
      chomp $seq;
      $totalLen += length $seq;
      for my $offset ( 0 .. length( $seq ) - $length ) {
          my $ssref =  \substr( $seq, $offset, $length );
          #printf STDERR "\rProcessing sequence %5d offset %05d", $., 
+$offset;

          for my $fuz ( keys %fuz ) {
              $fuzzyComps++;
              my $m =  $length - (my $x=( $fuz ^ $$ssref )) =~ tr[\0][
+\0];

              if( $m <= $FUZZY ) {
                    push @matches,[$offset,$fuz,$m]
                      if $matches++<$maximum;
              }
          }
      }
      mark_time sprintf "XOR  #%04d: %04d > %s",$.,$matches,@matches ?
+ ": ".
                           join(", ",map{ sprintf "%06d:%s:%d",@{$_}[0
+..2] } @matches)
                         : "None.";
  }
  mark_time "\n\nProcessed $. sequences\nAverage length: ",
            $totalLen / $.,"\n"."Total fuzzy comparisons: ",
            $fuzzyComps,"\n";
  close $SEQ;
}

#utility function
sub write_file(&@) {
  my ($sub,$file,$count,$override)=@_;
  return if !$override and -e $file;

  open my $fh,">","$file.tmp" or die "Error writing to '$file.tmp':$!"
+;
  print "Writing '$file'\n" if $VERBOSE;
  for (1..$count) {
    print $fh $sub->(),"\n"
      or die "Failed to print to '$file'\n";
  }
  close $fh
    or die "Failed to close '$file'\n";
  rename "$file","$file.bak"
    or die "Backup Rename failed:$!"
    if -e $file;
  rename "$file.tmp","$file"
    or die "Tmp Rename failed:$!";
  mark_time "File Complete: '$file'\n";
}

sub main {
  $|++;
  my $Fuzz=2;
  my $Words=1;
  my $Word_Size=25;
  my $Strings=1000;
  my $String_Size=1000;
  my $OverWrite=0;


  GetOptions('fuzz=i'       => \$Fuzz,
             'words=i'      => \$Words,
             'strings=i'    => \$Strings,
             'word_size=i'  => \$Word_Size,
             'string_size=i'=> \$String_Size,
             'over_write!'  => \$OverWrite,
             'verbose!'     => \$VERBOSE)
  or die "Bad Options\n";



  my @Chars=qw(A C T G);
  my $Word_File=sprintf "Fuzz-Words-W%04d-S%04d-WC%04d-SC%04d.fuzz",$W
+ord_Size,$String_Size,$Words,$Strings;
  my $String_File=sprintf "Fuzz-Strings-W%04d-S%04d-WC%04d-SC%04d.fuzz
+",$Word_Size,$String_Size,$Words,$Strings;

  rename $String_File,"$String_File.bak"
    if -e $String_File and !-e $Word_File;

  my @words;
  write_file {
    my $str=join "",map { $Chars[rand @Chars] } 1..$Word_Size;
    push @words,$str;
    $str;
  } $Word_File,$Words,$OverWrite;

  print "Getting strings\n" if $VERBOSE;

  write_file {
    my $str=$Chars[rand @Chars] x $String_Size;
    substr($str,$_-1,1)=$Chars[rand @Chars]
      for 1..$String_Size;
    
    for (1..$Word_Size) {
        my $p=int rand($String_Size-$Word_Size);
        my $w=$words[rand @words];
        substr($str,$p,$Word_Size,$w);
    }
    
    $str;
  } $String_File,$Strings,$OverWrite;
  mark_time "Finished building strings\nWords: $Word_File\nStrings: $S
+tring_File\n";
  clear_time("Starting TRIE\n");

  my $construct_time;
  { # Trie search

    @words=do{
                 open my $ifh,"<",$Word_File
                    or die "Can't read Word File '$Word_File':$!";
                  map { chomp; $_ } <$ifh>
             };
    my $time=mark_time "Got ".scalar(@words)." to fuzzyify\n";


    print "Making fuzzy strings\n" if $VERBOSE;

    my (%trie,%hash);
    foreach my $word (@words) {
      print length($word).":$word:" if $VERBOSE;

      my $count=make_variants($word,$Fuzz,\@Chars,\%hash,\%trie);

      $time=elapsed($time,$count) if $VERBOSE;
    }
    @words=sort keys %hash;
    mark_time "Fuzzy Words:".scalar(@words)." [".get_idx()." nodes in 
+tree]\n";
    #exit(0);

    print "Doing traverse...\n" if $VERBOSE;
    traverse(\%trie,\@Chars);
    mark_time "Finised Traverse";

    $construct_time=clear_time("Starting Trie Scan\n");

    open my $fh,"<",$String_File
      or die "Error reading stringfile '$String_File'\n";

    while (<$fh>) {
      chomp(my $string=$_);
      my ($count,@matches)=search(\%trie,$string);
      mark_time sprintf "TRIE #%04d: %04d > %s",$.,$count,@matches ? "
+: ".
                           join(", ",map{ sprintf "%06d:%s:%d",@{$_}[0
+..2] } @matches)
                         : "None.";
    }
    mark_time "Trie Done\n";
  }
  my $scan_time=clear_time("Starting xor_fuzz search\n");
  { # Xor Search
    xor_fuzz($Word_Size,$Fuzz,$Word_File,$String_File);
  }
  my $xor_total=clear_time("xor_fuzz search done.\n");
  print STDERR "**** WordSize: $Word_Size StringSize: $String_Size Wor
+dCount: $Words StringCount: $Strings  Xor: $xor_total Trie: "
        .($scan_time+$construct_time)." ($scan_time + $construct_time)
+\n";
  print "**** WordSize: $Word_Size StringSize: $String_Size WordCount:
+ $Words StringCount: $Strings  Xor: $xor_total Trie: "
        .($scan_time+$construct_time)." ($scan_time + $construct_time)
+\n";
  0
}
[download]

And here is how it is meant to be used: (note that it writes a special log to STDERR and it also will only print out the first 50 matches for either soltion so as not to overflow memory with such redundant data).

del *.fuzz
trie_xor.pl --word_size=10 --string_size=1000 --words=1 --over_write 2
+>fuzzlog-10-1000-01.fuzzlog 
trie_xor.pl --word_size=10 --string_size=1000 --words=2 --over_write 2
+>fuzzlog-10-1000-02.fuzzlog 
trie_xor.pl --word_size=10 --string_size=1000 --words=5 --over_write 2
+>fuzzlog-10-1000-05.fuzzlog 
trie_xor.pl --word_size=10 --string_size=1000 --words=10 --over_write 
+2>fuzzlog-10-1000-10.fuzzlog 

trie_xor.pl --word_size=10 --string_size=10000 --words=1 --over_write 
+2>fuzzlog-10-10000-01.fuzzlog 
trie_xor.pl --word_size=10 --string_size=10000 --words=2 --over_write 
+2>fuzzlog-10-10000-02.fuzzlog 
trie_xor.pl --word_size=10 --string_size=10000 --words=5 --over_write 
+2>fuzzlog-10-10000-05.fuzzlog 
trie_xor.pl --word_size=10 --string_size=10000 --words=10 --over_write
+ 2>fuzzlog-10-10000-10.fuzzlog 

trie_xor.pl --word_size=10 --string_size=100000 --words=1 --over_write
+ 2>fuzzlog-10-100000-01.fuzzlog 
trie_xor.pl --word_size=10 --string_size=100000 --words=2 --over_write
+ 2>fuzzlog-10-100000-02.fuzzlog 
trie_xor.pl --word_size=10 --string_size=100000 --words=5 --over_write
+ 2>fuzzlog-10-100000-05.fuzzlog 
trie_xor.pl --word_size=10 --string_size=100000 --words=10 --over_writ
+e 2>fuzzlog-10-100000-10.fuzzlog 

trie_xor.pl --word_size=25 --string_size=1000 --words=1 --over_write 2
+>fuzzlog-25-1000-01.fuzzlog 
trie_xor.pl --word_size=25 --string_size=1000 --words=2 --over_write 2
+>fuzzlog-25-1000-02.fuzzlog 
trie_xor.pl --word_size=25 --string_size=1000 --words=5 --over_write 2
+>fuzzlog-25-1000-05.fuzzlog 
trie_xor.pl --word_size=25 --string_size=1000 --words=10 --over_write 
+2>fuzzlog-25-1000-10.fuzzlog 

trie_xor.pl --word_size=25 --string_size=10000 --words=1 --over_write 
+2>fuzzlog-25-10000-01.fuzzlog 
trie_xor.pl --word_size=25 --string_size=10000 --words=2 --over_write 
+2>fuzzlog-25-10000-02.fuzzlog 
trie_xor.pl --word_size=25 --string_size=10000 --words=5 --over_write 
+2>fuzzlog-25-10000-05.fuzzlog 
trie_xor.pl --word_size=25 --string_size=10000 --words=10 --over_write
+ 2>fuzzlog-25-10000-10.fuzzlog 

trie_xor.pl --word_size=25 --string_size=100000 --words=1 --over_write
+ 2>fuzzlog-25-100000-01.fuzzlog 
trie_xor.pl --word_size=25 --string_size=100000 --words=2 --over_write
+ 2>fuzzlog-25-100000-02.fuzzlog 
trie_xor.pl --word_size=25 --string_size=100000 --words=5 --over_write
+ 2>fuzzlog-25-100000-05.fuzzlog 
trie_xor.pl --word_size=25 --string_size=100000 --words=10 --over_writ
+e 2>fuzzlog-25-100000-10.fuzzlog
[download]

A footnote: you mentioned a number of things about a DFA/Trie not being able to do. I suggest to you that you were mostly wrong. Adding meta data to states in a DFA is relatively easy to do. For instance accepting states can have associated data (in fact in the implementation above its the very presence of that data that indicates its an accepting state.)

---
demerphq

In reply to Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings) by demerphq
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.