demerphq The numbers in your table are meaningless.

  1. Your solution is broken.

    To show this, I ran your test harness (slightly modified to accept filenames from the command line and disable the "only output the first 50 matches" code) on two small testfiles. The first was generated using this small script which generates all the variations of a specified length key; I used 10-char keys (406 variations) because your prototype code cannot handle 25-chars (2701 variations).

    #! perl -slw use strict; our $LEN ||= 10; my $s = 'A' x $LEN; print $s; for my $p1 ( 0 .. $LEN -1 ) { for my $c1 ( qw[ C G T ] ) { for my $p2 ( $p1 .. $LEN -1 ) { next if $p1 == $p2; for my $c2 ( qw[ C G T ] ) { my $t = $s; substr $t, $p1, 1, $c1; substr $t, $p2, 1, $c2; print $t; } } } } __END__ P:\test\demerphq>keyGen -LEN=10 >test.words P:\test\demerphq>wc -l test.words 406 test.words

    Here i've printed the first and last 10 to show the sequence:

    P:\test\demerphq>u:head test.words AAAAAAAAAA CCAAAAAAAA CGAAAAAAAA CTAAAAAAAA CACAAAAAAA CAGAAAAAAA CATAAAAAAA CAACAAAAAA CAAGAAAAAA CAATAAAAAA P:\test\demerphq>u:tail test.words AAAAAAATAT AAAAAAAACC AAAAAAAACG AAAAAAAACT AAAAAAAAGC AAAAAAAAGG AAAAAAAAGT AAAAAAAATC AAAAAAAATG AAAAAAAATT

    I used the following, 11-char single sequence for the test:

    P:\test\demerphq>type test.seq AAAAAAAAAAA

    The thing to note is that every one of the 406 keys in test.words will match that single 11-char sequence. Twice, once at offset 0, and again at offset 1.

    Now the run.

    [11:37:48.89] P:\test\demerphq>TrieXor.orig.pl test.words test.seq %#% 11:37:51 ### Start %#% 11:37:51 ( 0.0048 : 0.0048) Finish Total Elapsed Time: 0.00477910041809082 secs %#% 11:37:51 ### Starting TRIE %#% 11:37:51 ( 0.0029 : 0.0029) Got 406 to fuzzyify %#% 11:37:52 ( 1.0381 : 1.0411) Fuzzy Words:16906 [39678 nodes i +n tree] %#% 11:37:56 ( 4.7969 : 5.8380) Finised Traverse %#% 11:37:56 ( 0.0000 : 5.8380) Finish Total Elapsed Time: 5.83795619010925 secs %#% 11:37:56 ### Starting Trie Scan %#% 11:37:56 ( 0.0000 : 0.0000) TRIE #0001: 0002 > : 000000:AAAAAAAAAA:0, 000001:AAAAAAAAAA:0 %#% 11:37:56 ( 0.0000 : 0.0000) Trie Done %#% 11:37:56 ( 0.0313 : 0.0313) Finish Total Elapsed Time: 0.03125 secs %#% 11:37:56 ### Starting xor_fuzz search %#% 11:37:56 ( 0.0000 : 0.0000) xor_fuzz loading 'test.words' %#% 11:37:56 ( 0.0000 : 0.0000) Loaded 406 10-ers %#% 11:37:56 ( 0.0000 : 0.0000) XOR #0001: 0000 > : 000000:ACAAAAAAGA:2, 000000:GAAAATAAAA:2, 000000:CAACAAAAAA:2, 000000:AAAACAAAAC:2, 000000:AAACATAAAA:2, 000000:AACAATAAAA:2, 000000:AGAAAACAAA:2, 000000:ATGAAAAAAA:2, 000000:CAAAAATAAA:2, 000000:AAAAAGAATA:2, 000000:AATAAAAGAA:2, 000000:AAACAAAAAC:2, 000000:AAAAAAAGAT:2, 000000:AGAAAAAAAT:2, 000000:AAACAAACAA:2, 000000:AAAAAAAACG:2, 000000:AAAGAAACAA:2, 000000:ATACAAAAAA:2, 000000:AAAAACAACA:2, 000000:AAACAAAGAA:2, 000000:ACAAAGAAAA:2, 000000:AAACAAAAAT:2, 000000:TAAAAGAAAA:2, 000000:AAAAAATCAA:2, 000000:AAAAAATAAT:2, 000000:AAAACCAAAA:2, 000000:AAATAAAGAA:2, 000000:ATAAAAAACA:2, 000000:AAGAAATAAA:2, 000000:AAACCAAAAA:2, 000000:AAAAGGAAAA:2, 000000:AAGAAAGAAA:2, 000000:AAGTAAAAAA:2, 000000:AAAAAAGAAC:2, 000000:ACAAAAAATA:2, 000000:AACAAAATAA:2, 000000:AATGAAAAAA:2, 000000:ACTAAAAAAA:2, 000000:CAAAAAAAAC:2, 000000:AAAAAACGAA:2, 000000:AAATAAAAGA:2, 000000:GAAACAAAAA:2, 000000:AAAAGAAAGA:2, 000000:AAAATAAGAA:2, 000000:ACATAAAAAA:2, 000000:AAAGAAATAA:2, 000000:ATAAAAATAA:2, 000000:TAAAAAAAAT:2, 000000:AAAACATAAA:2, 000000:AAAAAAAAGG:2, 000000:AAAAAAAGAG:2, 000000:AAGAACAAAA:2, 000000:AAAAAAAATT:2, 000000:TAAAAAACAA:2, 000000:GAAATAAAAA:2, 000000:AAATCAAAAA:2, 000000:ACAAACAAAA:2, 000000:ACAATAAAAA:2, 000000:CAATAAAAAA:2, 000000:AAAAAAACCA:2, 000000:TAAAAAGAAA:2, 000000:ATAAAAAAAT:2, 000000:AAAAATTAAA:2, 000000:AGAAAAAATA:2, 000000:ACAACAAAAA:2, 000000:CAAAAAAACA:2, 000000:AAAAAAATGA:2, 000000:AAATAGAAAA:2, 000000:ACAAAAAAAC:2, 000000:AAAAACATAA:2, 000000:AAAAATAACA:2, 000000:GAAGAAAAAA:2, 000000:CAAAAAATAA:2, 000000:AAAAATAAAT:2, 000000:GACAAAAAAA:2, 000000:ATATAAAAAA:2, 000000:AAAAAACAAC:2, 000000:AAAAACACAA:2, 000000:AAAAACAGAA:2, 000000:TAAAAAAAGA:2, 000000:AAACAACAAA:2, 000000:TAAAAAAAAC:2, 000000:AACAAATAAA:2, 000000:AACCAAAAAA:2, 000000:AAAAAAGACA:2, 000000:AAAGAAAATA:2, 000000:CTAAAAAAAA:2, 000000:CAAAGAAAAA:2, 000000:AGTAAAAAAA:2, 000000:AAAAAAGAAG:2, 000000:CAAATAAAAA:2, 000000:ACAAAAACAA:2, 000000:AAAAACGAAA:2, 000000:AAAAAGCAAA:2, 000000:CAAAAGAAAA:2, 000000:AAGGAAAAAA:2, 000000:AAAAAAAATG:2, 000000:AAAATAACAA:2, 000000:AAAAAGAAAT:2, 000000:AATAAAGAAA:2, 000000:AACAAGAAAA:2, 000000:ATAAAGAAAA:2, 000000:AGAAAAAAGA:2, 000000:AAAAATAATA:2, 000000:AAAACACAAA:2, 000000:AGAGAAAAAA:2, 000000:AGACAAAAAA:2, 000000:AAAATAAAAC:2, 000000:AAAGAAAAGA:2, 000000:AAAAAAGATA:2, 000000:AAAAAAACGA:2, 000000:AAAGAATAAA:2, 000000:AAAAAAATAG:2, 000000:AAAATAATAA:2, 000000:TAGAAAAAAA:2, 000000:AACAAAAACA:2, 000000:GAAAAAAATA:2, 000000:AAAAAAAGAC:2, 000000:AAAATACAAA:2, 000000:GAAAAACAAA:2, 000000:AAAAAAATAT:2, 000000:AAAAGAGAAA:2, 000000:AAAGAAAAAC:2, 000000:AATAACAAAA:2, 000000:AAAAGAAAAT:2, 000000:AAATACAAAA:2, 000000:AAAAATAAAG:2, 000000:AAATGAAAAA:2, 000000:AAATAAAATA:2, 000000:AAAAAATGAA:2, 000000:AAAAAGAAAC:2, 000000:AAAAGAAAAC:2, 000000:AATAAAACAA:2, 000000:AAAAAAGAAT:2, 000000:AAAAAAAAGT:2, 000000:AAAAAGAGAA:2, 000000:AAAAACCAAA:2, 000000:CAAAAAAATA:2, 000000:TAAGAAAAAA:2, 000000:AAAAATAAGA:2, 000000:AAAAAACAAG:2, 000000:AAACGAAAAA:2, 000000:GAAAAAAAAT:2, 000000:AACATAAAAA:2, 000000:AAAGTAAAAA:2, 000000:CAGAAAAAAA:2, 000000:ACAAAAAGAA:2, 000000:AAAAAAAGCA:2, 000000:AAGACAAAAA:2, 000000:AAAGCAAAAA:2, 000000:ACAAAAATAA:2, 000000:CAAAACAAAA:2, 000000:AGAAAATAAA:2, 000000:AACAAAACAA:2, 000000:AAAAAAACAT:2, 000000:AGAAAAAGAA:2, 000000:ATAAAACAAA:2, 000000:AAAAATACAA:2, 000000:AAAAAAAGTA:2, 000000:TAAAACAAAA:2, 000000:ACAAAAAAAG:2, 000000:TAACAAAAAA:2, 000000:GAAAAAAAAC:2, 000000:GGAAAAAAAA:2, 000000:TAAAGAAAAA:2, 000000:AGAAGAAAAA:2, 000000:AAATTAAAAA:2, 000000:AAACAAAAGA:2, 000000:AAATAAAAAG:2, 000000:AATAAAAAGA:2, 000000:AAATAAAAAT:2, 000000:AAAGAACAAA:2, 000000:TATAAAAAAA:2, 000000:AAGAAAAAGA:2, 000000:AAAAAAAGGA:2, 000000:ATAAAAAATA:2, 000000:TAAAAACAAA:2, 000000:AACAGAAAAA:2, 000000:AAAAACAAAC:2, 000000:AAAAAAAATC:2, 000000:CAAAAAAAGA:2, 000000:AAAAAATAGA:2, 000000:AAAAAAATAC:2, 000000:GAGAAAAAAA:2, 000000:AAAAAGTAAA:2, 000000:AACAAAGAAA:2, 000000:AAACAAGAAA:2, 000000:ATAAAAGAAA:2, 000000:ACAAAACAAA:2, 000000:AAAAAAACAG:2, 000000:GAAAAATAAA:2, 000000:TAAAATAAAA:2, 000000:GAAAAAAGAA:2, 000000:ACAAAAAACA:2, 000000:AAAAGTAAAA:2, 000000:AATAAAAAAC:2, 000000:AAAAAACAGA:2, 000000:AAAGAAAACA:2, 000000:GAAAACAAAA:2, 000000:GAATAAAAAA:2, 000000:AGAAACAAAA:2, 000000:AAAAGAAAAG:2, 000000:AATTAAAAAA:2, 000000:GTAAAAAAAA:2, 000000:TAAAAATAAA:2, 000000:AAAATAAACA:2, 000000:TAAAAAAAAG:2, 000000:AAGAAAACAA:2, 000000:TAAAAAATAA:2, 000000:AGGAAAAAAA:2, 000000:CAAAAACAAA:2, 000000:AAAAGAAATA:2, 000000:ACAGAAAAAA:2, 000000:AAAGAAGAAA:2, 000000:AAAAACAATA:2, 000000:AAAGAAAGAA:2, 000000:AAAAAAACTA:2, 000000:AAGAATAAAA:2, 000000:AATAAGAAAA:2, 000000:AAAAATCAAA:2, 000000:TAAAAAAATA:2, 000000:AAAAAATAAG:2, 000000:AGAAAAGAAA:2, 000000:AATAAAAATA:2, 000000:AAATAAAACA:2, 000000:ATCAAAAAAA:2, 000000:ATAAAAAAAC:2, 000000:GAAAAAAAGA:2, 000000:ATAAGAAAAA:2, 000000:AACAAAAATA:2, 000000:AAAGGAAAAA:2, 000000:ACAAGAAAAA:2, 000000:CATAAAAAAA:2, 000000:AGAACAAAAA:2, 000000:AAGAAAAAAT:2, 000000:GAAAGAAAAA:2, 000000:AATAATAAAA:2, 000000:GAAAAAACAA:2, 000000:AAACACAAAA:2, 000000:GAAAAAATAA:2, 000000:AAAAACAAAT:2, 000000:AAACAAAACA:2, 000000:AATAAAAACA:2, 000000:AAAGACAAAA:2, 000000:AAAATAGAAA:2, 000000:AAAACAAGAA:2, 000000:AAAATTAAAA:2, 000000:AAAACAGAAA:2, 000000:AAAAACAAAG:2, 000000:ATAAACAAAA:2, 000000:AGCAAAAAAA:2, 000000:AACAACAAAA:2, 000000:AAAGAAAAAG:2, 000000:AAAAAACAAT:2, 000000:AAAAGAAACA:2, 000000:AAACTAAAAA:2, 000000:AACGAAAAAA:2, 000000:CGAAAAAAAA:2, 000000:AAATAATAAA:2, 000000:AATAAATAAA:2, 000000:AGAAAAACAA:2, 000000:AAGAAGAAAA:2, 000000:AATATAAAAA:2, 000000:AAACAGAAAA:2, 000000:GCAAAAAAAA:2, 000000:AAAAACAAGA:2, 000000:AGAATAAAAA:2, 000000:ATAAAAAAGA:2, 000000:AAATAAACAA:2, 000000:AAAAAAAACT:2, 000000:AAAAATAGAA:2, 000000:AAGAAACAAA:2, 000000:AAAAAACACA:2, 000000:AAAATAAAAT:2, 000000:CAAAAAAGAA:2, 000000:AAAATGAAAA:2, 000000:AGAAAAATAA:2, 000000:AAAGATAAAA:2, 000000:AAAAAATAAC:2, 000000:AAAACAAAGA:2, 000000:TCAAAAAAAA:2, 000000:AATAGAAAAA:2, 000000:AAACAAAAAG:2, 000000:AAAACAAAAT:2, 000000:AGAAAAAACA:2, 000000:ATAAAATAAA:2, 000000:ATAAAAAGAA:2, 000000:AACAAAAAAG:2, 000000:AAAAAGAACA:2, 000000:AAAAGCAAAA:2, 000000:AAAAAAGAGA:2, 000000:AAAAAAATCA:2, 000000:ATAGAAAAAA:2, 000000:AAAAAACTAA:2, 000000:AAAACAAATA:2, 000000:ACGAAAAAAA:2, 000000:AAGATAAAAA:2, 000000:AAAAAGACAA:2, 000000:AAAACTAAAA:2, 000000:AAGAAAAAAG:2, 000000:AATAAAAAAG:2, 000000:AAAAAAGGAA:2, 000000:ACACAAAAAA:2, 000000:AAAATAAATA:2, 000000:AAAATAAAGA:2, 000000:AGATAAAAAA:2, 000000:GAACAAAAAA:2, 000000:CAAAAAAAAG:2, 000000:ATAATAAAAA:2, 000000:AACAAACAAA:2, 000000:AAAAAACATA:2, 000000:TGAAAAAAAA:2, 000000:GAAAAGAAAA:2, 000000:AAAATCAAAA:2, 000000:AGAAAAAAAC:2, 000000:TAAACAAAAA:2, 000000:AACAAAAGAA:2, 000000:AAAAAGGAAA:2, 000000:AACACAAAAA:2, 000000:AAAAAATACA:2, 000000:AAAAATAAAC:2, 000000:AAATAAAAAC:2, 000000:AAAAAAGTAA:2, 000000:AATAAAAAAT:2, 000000:AAAACAATAA:2, 000000:AAAAAAGCAA:2, 000000:AGAAATAAAA:2, 000000:GAAAAAAAAG:2, 000000:AATACAAAAA:2, 000000:AATCAAAAAA:2, 000000:ACCAAAAAAA:2, 000000:CAAAAAGAAA:2, 000000:AAGAAAATAA:2, 000000:AATAAAATAA:2, 000000:ATAACAAAAA:2, 000000:AAAACGAAAA:2, 000000:GAAAAAGAAA:2, 000000:AAATAACAAA:2, 000000:ACAAAAAAAT:2, 000000:AAAAATATAA:2, 000000:AAATAAGAAA:2, 000000:CCAAAAAAAA:2, 000000:AACAAAAAAT:2, 000000:ATAAATAAAA:2, 000000:CAAGAAAAAA:2, 000000:AAACAAAATA:2, 000000:AAAAAAAACC:2, 000000:AAAAAGAAGA:2, 000000:AAAACAAAAG:2, 000000:ATAAAAAAAG:2, 000000:AAATAAATAA:2, 000000:TAATAAAAAA:2, 000000:CAAAATAAAA:2, 000000:AAAAAACCAA:2, 000000:TAAATAAAAA:2, 000000:AGAAAGAAAA:2, 000000:AAAGAAAAAT:2, 000000:AAAAAAACAC:2, 000000:AAAAAGATAA:2, 000000:AAGAAAAAAC:2, 000000:CAAACAAAAA:2, 000000:GAAAAAAACA:2, 000000:AAACAAATAA:2, 000000:AAGAAAAATA:2, 000000:AGAAAAAAAG:2, 000000:AAAAGACAAA:2, 000000:AAAAAGAAAG:2, 000000:AAAAAATTAA:2, 000000:AAAAAAAAGC:2, 000000:AAATATAAAA:2, 000000:AACAAAAAAC:2, 000000:AAAAAAATTA:2, 000000:AATAAACAAA:2, 000000:AAAGAGAAAA:2, 000000:AAAAAATATA:2, 000000:ACAAAAGAAA:2, 000000:AAGAAAAGAA:2, 000000:TAAAAAAGAA:2, 000000:GATAAAAAAA:2, 000000:AAAAGATAAA:2, 000000:ATTAAAAAAA:2, 000000:AAAAATGAAA:2, 000000:CAAAAAAAAT:2, 000000:AAGAGAAAAA:2, 000000:AAAATAAAAG:2, 000000:AAAAGAATAA:2, 000000:AAAAGAACAA:2, 000000:AACAAAAAGA:2, 000000:AACTAAAAAA:2, 000000:ACAAATAAAA:2, 000000:TTAAAAAAAA:2, 000000:AAAAACTAAA:2, 000000:AAGAAAAACA:2, 000000:AAGCAAAAAA:2, 000000:TACAAAAAAA:2, 000000:AAAAAAAAAA:0, 000000:TAAAAAAACA:2, 000000:AAAAGAAGAA:2, 000000:AAAATATAAA:2, 000000:ATAAAAACAA:2, 000000:ACAAAATAAA:2, 000000:AAAACAACAA:2, 000000:CAAAAAACAA:2, 000000:CACAAAAAAA:2, 000000:AAACAATAAA:2, 000000:AAAACAAACA:2, 000001:ACAAAAAAGA:2, 000001:GAAAATAAAA:2, 000001:CAACAAAAAA:2, 000001:AAAACAAAAC:2, 000001:AAACATAAAA:2, 000001:AACAATAAAA:2, 000001:AGAAAACAAA:2, 000001:ATGAAAAAAA:2, 000001:CAAAAATAAA:2, 000001:AAAAAGAATA:2, 000001:AATAAAAGAA:2, 000001:AAACAAAAAC:2, 000001:AAAAAAAGAT:2, 000001:AGAAAAAAAT:2, 000001:AAACAAACAA:2, 000001:AAAAAAAACG:2, 000001:AAAGAAACAA:2, 000001:ATACAAAAAA:2, 000001:AAAAACAACA:2, 000001:AAACAAAGAA:2, 000001:ACAAAGAAAA:2, 000001:AAACAAAAAT:2, 000001:TAAAAGAAAA:2, 000001:AAAAAATCAA:2, 000001:AAAAAATAAT:2, 000001:AAAACCAAAA:2, 000001:AAATAAAGAA:2, 000001:ATAAAAAACA:2, 000001:AAGAAATAAA:2, 000001:AAACCAAAAA:2, 000001:AAAAGGAAAA:2, 000001:AAGAAAGAAA:2, 000001:AAGTAAAAAA:2, 000001:AAAAAAGAAC:2, 000001:ACAAAAAATA:2, 000001:AACAAAATAA:2, 000001:AATGAAAAAA:2, 000001:ACTAAAAAAA:2, 000001:CAAAAAAAAC:2, 000001:AAAAAACGAA:2, 000001:AAATAAAAGA:2, 000001:GAAACAAAAA:2, 000001:AAAAGAAAGA:2, 000001:AAAATAAGAA:2, 000001:ACATAAAAAA:2, 000001:AAAGAAATAA:2, 000001:ATAAAAATAA:2, 000001:TAAAAAAAAT:2, 000001:AAAACATAAA:2, 000001:AAAAAAAAGG:2, 000001:AAAAAAAGAG:2, 000001:AAGAACAAAA:2, 000001:AAAAAAAATT:2, 000001:TAAAAAACAA:2, 000001:GAAATAAAAA:2, 000001:AAATCAAAAA:2, 000001:ACAAACAAAA:2, 000001:ACAATAAAAA:2, 000001:CAATAAAAAA:2, 000001:AAAAAAACCA:2, 000001:TAAAAAGAAA:2, 000001:ATAAAAAAAT:2, 000001:AAAAATTAAA:2, 000001:AGAAAAAATA:2, 000001:ACAACAAAAA:2, 000001:CAAAAAAACA:2, 000001:AAAAAAATGA:2, 000001:AAATAGAAAA:2, 000001:ACAAAAAAAC:2, 000001:AAAAACATAA:2, 000001:AAAAATAACA:2, 000001:GAAGAAAAAA:2, 000001:CAAAAAATAA:2, 000001:AAAAATAAAT:2, 000001:GACAAAAAAA:2, 000001:ATATAAAAAA:2, 000001:AAAAAACAAC:2, 000001:AAAAACACAA:2, 000001:AAAAACAGAA:2, 000001:TAAAAAAAGA:2, 000001:AAACAACAAA:2, 000001:TAAAAAAAAC:2, 000001:AACAAATAAA:2, 000001:AACCAAAAAA:2, 000001:AAAAAAGACA:2, 000001:AAAGAAAATA:2, 000001:CTAAAAAAAA:2, 000001:CAAAGAAAAA:2, 000001:AGTAAAAAAA:2, 000001:AAAAAAGAAG:2, 000001:CAAATAAAAA:2, 000001:ACAAAAACAA:2, 000001:AAAAACGAAA:2, 000001:AAAAAGCAAA:2, 000001:CAAAAGAAAA:2, 000001:AAGGAAAAAA:2, 000001:AAAAAAAATG:2, 000001:AAAATAACAA:2, 000001:AAAAAGAAAT:2, 000001:AATAAAGAAA:2, 000001:AACAAGAAAA:2, 000001:ATAAAGAAAA:2, 000001:AGAAAAAAGA:2, 000001:AAAAATAATA:2, 000001:AAAACACAAA:2, 000001:AGAGAAAAAA:2, 000001:AGACAAAAAA:2, 000001:AAAATAAAAC:2, 000001:AAAGAAAAGA:2, 000001:AAAAAAGATA:2, 000001:AAAAAAACGA:2, 000001:AAAGAATAAA:2, 000001:AAAAAAATAG:2, 000001:AAAATAATAA:2, 000001:TAGAAAAAAA:2, 000001:AACAAAAACA:2, 000001:GAAAAAAATA:2, 000001:AAAAAAAGAC:2, 000001:AAAATACAAA:2, 000001:GAAAAACAAA:2, 000001:AAAAAAATAT:2, 000001:AAAAGAGAAA:2, 000001:AAAGAAAAAC:2, 000001:AATAACAAAA:2, 000001:AAAAGAAAAT:2, 000001:AAATACAAAA:2, 000001:AAAAATAAAG:2, 000001:AAATGAAAAA:2, 000001:AAATAAAATA:2, 000001:AAAAAATGAA:2, 000001:AAAAAGAAAC:2, 000001:AAAAGAAAAC:2, 000001:AATAAAACAA:2, 000001:AAAAAAGAAT:2, 000001:AAAAAAAAGT:2, 000001:AAAAAGAGAA:2, 000001:AAAAACCAAA:2, 000001:CAAAAAAATA:2, 000001:TAAGAAAAAA:2, 000001:AAAAATAAGA:2, 000001:AAAAAACAAG:2, 000001:AAACGAAAAA:2, 000001:GAAAAAAAAT:2, 000001:AACATAAAAA:2, 000001:AAAGTAAAAA:2, 000001:CAGAAAAAAA:2, 000001:ACAAAAAGAA:2, 000001:AAAAAAAGCA:2, 000001:AAGACAAAAA:2, 000001:AAAGCAAAAA:2, 000001:ACAAAAATAA:2, 000001:CAAAACAAAA:2, 000001:AGAAAATAAA:2, 000001:AACAAAACAA:2, 000001:AAAAAAACAT:2, 000001:AGAAAAAGAA:2, 000001:ATAAAACAAA:2, 000001:AAAAATACAA:2, 000001:AAAAAAAGTA:2, 000001:TAAAACAAAA:2, 000001:ACAAAAAAAG:2, 000001:TAACAAAAAA:2, 000001:GAAAAAAAAC:2, 000001:GGAAAAAAAA:2, 000001:TAAAGAAAAA:2, 000001:AGAAGAAAAA:2, 000001:AAATTAAAAA:2, 000001:AAACAAAAGA:2, 000001:AAATAAAAAG:2, 000001:AATAAAAAGA:2, 000001:AAATAAAAAT:2, 000001:AAAGAACAAA:2, 000001:TATAAAAAAA:2, 000001:AAGAAAAAGA:2, 000001:AAAAAAAGGA:2, 000001:ATAAAAAATA:2, 000001:TAAAAACAAA:2, 000001:AACAGAAAAA:2, 000001:AAAAACAAAC:2, 000001:AAAAAAAATC:2, 000001:CAAAAAAAGA:2, 000001:AAAAAATAGA:2, 000001:AAAAAAATAC:2, 000001:GAGAAAAAAA:2, 000001:AAAAAGTAAA:2, 000001:AACAAAGAAA:2, 000001:AAACAAGAAA:2, 000001:ATAAAAGAAA:2, 000001:ACAAAACAAA:2, 000001:AAAAAAACAG:2, 000001:GAAAAATAAA:2, 000001:TAAAATAAAA:2, 000001:GAAAAAAGAA:2, 000001:ACAAAAAACA:2, 000001:AAAAGTAAAA:2, 000001:AATAAAAAAC:2, 000001:AAAAAACAGA:2, 000001:AAAGAAAACA:2, 000001:GAAAACAAAA:2, 000001:GAATAAAAAA:2, 000001:AGAAACAAAA:2, 000001:AAAAGAAAAG:2, 000001:AATTAAAAAA:2, 000001:GTAAAAAAAA:2, 000001:TAAAAATAAA:2, 000001:AAAATAAACA:2, 000001:TAAAAAAAAG:2, 000001:AAGAAAACAA:2, 000001:TAAAAAATAA:2, 000001:AGGAAAAAAA:2, 000001:CAAAAACAAA:2, 000001:AAAAGAAATA:2, 000001:ACAGAAAAAA:2, 000001:AAAGAAGAAA:2, 000001:AAAAACAATA:2, 000001:AAAGAAAGAA:2, 000001:AAAAAAACTA:2, 000001:AAGAATAAAA:2, 000001:AATAAGAAAA:2, 000001:AAAAATCAAA:2, 000001:TAAAAAAATA:2, 000001:AAAAAATAAG:2, 000001:AGAAAAGAAA:2, 000001:AATAAAAATA:2, 000001:AAATAAAACA:2, 000001:ATCAAAAAAA:2, 000001:ATAAAAAAAC:2, 000001:GAAAAAAAGA:2, 000001:ATAAGAAAAA:2, 000001:AACAAAAATA:2, 000001:AAAGGAAAAA:2, 000001:ACAAGAAAAA:2, 000001:CATAAAAAAA:2, 000001:AGAACAAAAA:2, 000001:AAGAAAAAAT:2, 000001:GAAAGAAAAA:2, 000001:AATAATAAAA:2, 000001:GAAAAAACAA:2, 000001:AAACACAAAA:2, 000001:GAAAAAATAA:2, 000001:AAAAACAAAT:2, 000001:AAACAAAACA:2, 000001:AATAAAAACA:2, 000001:AAAGACAAAA:2, 000001:AAAATAGAAA:2, 000001:AAAACAAGAA:2, 000001:AAAATTAAAA:2, 000001:AAAACAGAAA:2, 000001:AAAAACAAAG:2, 000001:ATAAACAAAA:2, 000001:AGCAAAAAAA:2, 000001:AACAACAAAA:2, 000001:AAAGAAAAAG:2, 000001:AAAAAACAAT:2, 000001:AAAAGAAACA:2, 000001:AAACTAAAAA:2, 000001:AACGAAAAAA:2, 000001:CGAAAAAAAA:2, 000001:AAATAATAAA:2, 000001:AATAAATAAA:2, 000001:AGAAAAACAA:2, 000001:AAGAAGAAAA:2, 000001:AATATAAAAA:2, 000001:AAACAGAAAA:2, 000001:GCAAAAAAAA:2, 000001:AAAAACAAGA:2, 000001:AGAATAAAAA:2, 000001:ATAAAAAAGA:2, 000001:AAATAAACAA:2, 000001:AAAAAAAACT:2, 000001:AAAAATAGAA:2, 000001:AAGAAACAAA:2, 000001:AAAAAACACA:2, 000001:AAAATAAAAT:2, 000001:CAAAAAAGAA:2, 000001:AAAATGAAAA:2, 000001:AGAAAAATAA:2, 000001:AAAGATAAAA:2, 000001:AAAAAATAAC:2, 000001:AAAACAAAGA:2, 000001:TCAAAAAAAA:2, 000001:AATAGAAAAA:2, 000001:AAACAAAAAG:2, 000001:AAAACAAAAT:2, 000001:AGAAAAAACA:2, 000001:ATAAAATAAA:2, 000001:ATAAAAAGAA:2, 000001:AACAAAAAAG:2, 000001:AAAAAGAACA:2, 000001:AAAAGCAAAA:2, 000001:AAAAAAGAGA:2, 000001:AAAAAAATCA:2, 000001:ATAGAAAAAA:2, 000001:AAAAAACTAA:2, 000001:AAAACAAATA:2, 000001:ACGAAAAAAA:2, 000001:AAGATAAAAA:2, 000001:AAAAAGACAA:2, 000001:AAAACTAAAA:2, 000001:AAGAAAAAAG:2, 000001:AATAAAAAAG:2, 000001:AAAAAAGGAA:2, 000001:ACACAAAAAA:2, 000001:AAAATAAATA:2, 000001:AAAATAAAGA:2, 000001:AGATAAAAAA:2, 000001:GAACAAAAAA:2, 000001:CAAAAAAAAG:2, 000001:ATAATAAAAA:2, 000001:AACAAACAAA:2, 000001:AAAAAACATA:2, 000001:TGAAAAAAAA:2, 000001:GAAAAGAAAA:2, 000001:AAAATCAAAA:2, 000001:AGAAAAAAAC:2, 000001:TAAACAAAAA:2, 000001:AACAAAAGAA:2, 000001:AAAAAGGAAA:2, 000001:AACACAAAAA:2, 000001:AAAAAATACA:2, 000001:AAAAATAAAC:2, 000001:AAATAAAAAC:2, 000001:AAAAAAGTAA:2, 000001:AATAAAAAAT:2, 000001:AAAACAATAA:2, 000001:AAAAAAGCAA:2, 000001:AGAAATAAAA:2, 000001:GAAAAAAAAG:2, 000001:AATACAAAAA:2, 000001:AATCAAAAAA:2, 000001:ACCAAAAAAA:2, 000001:CAAAAAGAAA:2, 000001:AAGAAAATAA:2, 000001:AATAAAATAA:2, 000001:ATAACAAAAA:2, 000001:AAAACGAAAA:2, 000001:GAAAAAGAAA:2, 000001:AAATAACAAA:2, 000001:ACAAAAAAAT:2, 000001:AAAAATATAA:2, 000001:AAATAAGAAA:2, 000001:CCAAAAAAAA:2, 000001:AACAAAAAAT:2, 000001:ATAAATAAAA:2, 000001:CAAGAAAAAA:2, 000001:AAACAAAATA:2, 000001:AAAAAAAACC:2, 000001:AAAAAGAAGA:2, 000001:AAAACAAAAG:2, 000001:ATAAAAAAAG:2, 000001:AAATAAATAA:2, 000001:TAATAAAAAA:2, 000001:CAAAATAAAA:2, 000001:AAAAAACCAA:2, 000001:TAAATAAAAA:2, 000001:AGAAAGAAAA:2, 000001:AAAGAAAAAT:2, 000001:AAAAAAACAC:2, 000001:AAAAAGATAA:2, 000001:AAGAAAAAAC:2, 000001:CAAACAAAAA:2, 000001:GAAAAAAACA:2, 000001:AAACAAATAA:2, 000001:AAGAAAAATA:2, 000001:AGAAAAAAAG:2, 000001:AAAAGACAAA:2, 000001:AAAAAGAAAG:2, 000001:AAAAAATTAA:2, 000001:AAAAAAAAGC:2, 000001:AAATATAAAA:2, 000001:AACAAAAAAC:2, 000001:AAAAAAATTA:2, 000001:AATAAACAAA:2, 000001:AAAGAGAAAA:2, 000001:AAAAAATATA:2, 000001:ACAAAAGAAA:2, 000001:AAGAAAAGAA:2, 000001:TAAAAAAGAA:2, 000001:GATAAAAAAA:2, 000001:AAAAGATAAA:2, 000001:ATTAAAAAAA:2, 000001:AAAAATGAAA:2, 000001:CAAAAAAAAT:2, 000001:AAGAGAAAAA:2, 000001:AAAATAAAAG:2, 000001:AAAAGAATAA:2, 000001:AAAAGAACAA:2, 000001:AACAAAAAGA:2, 000001:AACTAAAAAA:2, 000001:ACAAATAAAA:2, 000001:TTAAAAAAAA:2, 000001:AAAAACTAAA:2, 000001:AAGAAAAACA:2, 000001:AAGCAAAAAA:2, 000001:TACAAAAAAA:2, 000001:AAAAAAAAAA:0, 000001:TAAAAAAACA:2, 000001:AAAAGAAGAA:2, 000001:AAAATATAAA:2, 000001:ATAAAAACAA:2, 000001:ACAAAATAAA:2, 000001:AAAACAACAA:2, 000001:CAAAAAACAA:2, 000001:CACAAAAAAA:2, 000001:AAACAATAAA:2, 000001:AAAACAAACA:2 %#% 11:37:56 ( 0.0000 : 0.0000) Processed 1 sequences Average +length: 11 Total fuzzy comparisons: 812 %#% 11:37:56 ( 0.0000 : 0.0000) Finish Total Elapsed Time: 0 secs **** WordSize: 10 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor +: 0 Trie: 5.86920619010925 (0.03125 + 5.83795619010925) **** WordSize: 10 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor +: 0 Trie: 5.86920619010925 (0.03125 + 5.83795619010925) %#% 11:37:57 ### xor_fuzz search done. %#% 11:37:57 ( 0.0156 : 0.0156) Finish Total Elapsed Time: 0.015625 secs [11:37:57.14] P:\test\demerphq>

    The explaination of why your code only finds two matches, and attributes them to exact matches and mine finds those two + 810 more?

    Simple. Your Trie implementation will only report the first match it finds at any given position. If you delete the first key in test.words 'AAAAAAAAAA', then it will still report finding 'AAAAAAAAAA'! because every other key in that file can be considered a fuzzy match for 'AAAAAAAAAA'.

    What this means is that your code is only doing a fraction of the work that mine is doing (and that is required). In the case of the 10-char keys above, you code is only doing 1/406th (2.5%) of the required work. By the time you get to 25-char keys, this drops to 1/2701th (0.03%).

    For the explaination of those numbers, you need only go back and read my analysis , and make some attempt to understand those "...totally nonsensical equations and arguments supposedly proving how my approach just cant work, ..."

    So, for your "...but hey it outperforms yours..." claims. If you only do a tiny fraction of what is required--sure you'll do that quicker. Even that claim is bogus if you cannot build the datastructure required to achieve your "performance".

  2. Update: Section removed. I screwed up. 'B' is not a part of the alphabet 'ACGT' under test.
  3. There are myriad other things wrong with both your algorithm and your implementation of it.
    • Your home-brew timing code incorporates it's own runtime into the things under test.
    • Your logging code--in particular, the code that limits the number of matches output, for no good reason other than beautification--imposes a substantial overhead on both our algorithms.
      • By accumulating the matches in an array, instead of outputting them as the are found.
      • By spending time formatting (push @time, join, @_, sub sprintf ... join, map{ sprintf } @matches ) them into a single line, rather than just printing them!

      If you get around to performing your timings without that code in place you'll find that not only does it markedly alter the results, but that the impact of it, due to the differences in our algorithms (and nesting levels at which it is done), is disproportionate. I'll let you work out for yourself which algorithm is most affected.

    • Your protoype -v- production, Perl --v- C claims "...imagine what a String/Struct/Inline::C/Perl hybrid would do. Using probably less than 1/100th of the ram..." (since silently modified.) are also totally without basis.

      Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), nearly an order of magnitude 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor.

All in all--eminently worthy of my "Utterly bollocked" private /msg.


I could gone on, but then this would seem like a hatchet job. It isn't. I've expended the best part of this last week on this analysis, on top of the time I put into that I had already done. That you so readily dismissed as "...totally nonsensical equations and arguments...".

This is the 5th revision of this document. Each time I've attempted to remove the vitriol from it. Reading it back now I see I still probably haven't achieved that as well as people will expect. Sorry, but every time I read this and this, it gets me so hot under the collar that it just comes out. So be it, whatever is left will stand. If this amount of effort is rewarded with downvotes that's fine by me.

Two final comments:

  1. It is a truism that "War always causes an arms race; technology always advances faster under war than under peace".

    And so it is. As a result of your pushing, and the penny clicking that an optimisation alluded to in the OP (Option #2), I have now improved the performance of my algorithm, especially on long sequences (strings) by almost an order of magnitude compared to the code I previously published. That still means that the OPs stated requirements will take a substantial amount of time, but then they are substantial requirements.

  2. I probably wouldn't have pursued that optimisation had you not so incensed me.

    Imagine what we could do if we worked together, instead of against each other? Of course, it seems to be human nature that we (man in general; and you and I specifically), cannot make this transition.

    I think that I see a way to combine the XOR method with a (series of small) Tries such that it would achieve an even faster solution whilst staying within the realms of reality as far as memory consumption is concerned. I've implemented a generic, pure Perl Trie that uses substantially less memory (though it is slightly slower than yours).

    If things pan out, you might hear more on that. In any case, the Trie implementation has some rather interesting other uses, so you might here more about that anyway.


Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!) by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.