comment on

Well BrowserUk im afraid the numbers just dont agree with you at all.

No. Your figures don't agree. Your very selective figures.

The OP presented the question in the form

The basic problem:
Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:
exact
fuzzy (1- and 2-letter mismatches)

within the dictionary.
Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.

I've highlighted the (IMO, which is as good as any other as the OP hasn't returned to clarify matters) significant figures in that quote.

Your table is very interesting. Particularly in what it doesn't show.

The most important point is that your algorithm does not scale!

The first thing I noticed was that the biggest number in the OPs presentation of the problem, was the one you chose to vary through the most limited range?

He specified, clearly, 100,000 25-char keys ("words" as you refer to them). You only varied this value through a range of 1, 2, 5 & 10. I wondered why?

The following is the output from your unmodified program, apart from the addition of one extra line of diagnostics which I'll post at the bottom.

[ 5:40:10.75] P:\test\demerphq>del *.fuzz rem Ensure a clean plate

[ 5:40:16.21] P:\test\demerphq>dir
 Volume in drive P has no label.
 Volume Serial Number is BCCA-B4CC

 Directory of P:\test\demerphq

18/11/2004  05:40    <DIR>          .
18/11/2004  05:40    <DIR>          ..
18/11/2004  05:38             9,597 demerphq.pl
               1 File(s)          9,597 bytes
               2 Dir(s)  48,390,365,184 bytes free

[ 5:40:21.90] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=1
2>&1 | find "****"

**** WordSize: 25 
     StringSize: 1000 
     WordCount: 1 
     StringCount: 1000  
     Xor: 4.453125 
     Trie: 10.390625 (1.859375 + 8.53125)

**** perl.exe  416 0 12,320 K

[ 5:40:43.59] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=10
2>&1 | find "****"

**** WordSize: 25 
     StringSize: 1000 
     WordCount: 10 
     StringCount: 1000  
     Xor: 21.0625 
     Trie: 95.796875 (2.328125 + 93.46875)

**** perl.exe  3196 0 78,164 K

[ 5:42:51.90] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=100
2>&1 | find "****"

#### The above run self-terminated (crashed) because it was out of mem
+ory!!! ####

[ 5:49:20.62] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=20
2>&1 | find "****"

**** WordSize: 25 
     StringSize: 1000 
     WordCount: 20 
     StringCount: 1000  
     Xor: 41.453125 
     Trie: 195.71875 (2.5 + 193.21875)

**** perl.exe 2708 0 149,812 K

[ 5:53:46.56] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=30
2>&1 | find "****"

**** WordSize: 25 
     StringSize: 1000 
     WordCount: 30 
     StringCount: 1000  
     Xor: 59.59375 
     Trie: 293.75 (2.59375 + 291.15625)

**** perl.exe 3504 0 222,616 K

[ 6:00:00.59] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=40
2>&1 | find "****"

**** WordSize: 25 
     StringSize: 1000 
     WordCount: 40 
     StringCount: 1000  
     Xor: 79.265625 
     Trie: 404.96875 (2.734375 + 402.234375)

**** perl.exe 3412 0 293,504 K

[ 6:13:48.76] P:\test\demerphq>perl demerphq.pl 
--word_size=25 --string_size=1000 --words=50 
2>&1 | find "****"

**** WordSize: 
     25 StringSize: 1000 
     WordCount: 50 
     StringCount: 1000  
     Xor: 97.546875 
     Trie: 494.3125 (2.796875 + 491.515625)

**** perl.exe 3524 0 366,868 K
[download]

As you can see, I filtered the output to show just the headline numbers, and the extra line of diagnostics I added. I've also hand wrapped the output for the usual reasons.

The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours.

You even predicted the problem, but dismissed it

If you look at the third run I attempted above, you'll see that I annotated it as running out of memory. I did a little math based on the memory consumption figures output using the MS tasklist.exe utility.

1 x 25-char key requires 12 MB.
10 x 25-char keys (words) requires 78 MB
20 x 25-char keys (words) requires 149 MB (delta for 10 words: 71 MB.)
30 x 25-char keys (words) requires 222 MB (delta for 10 words: 73 MB.)
40 x 25-char keys (words) requires 293 MB (delta for 10 words: 71 MB.)
50 x 25-char keys (words) requires 366 MB (delta for 10 words: 73 MB.)

That's pretty consistant, and linear, at > 70 MB / 10 keys.

So, calculating the RAM requirement to build the MegaTrie DFA that your algorithm depends on for it's speed, comes out to:

100,000 x 25-char keys (words) requires 696 Gigabytes of RAM.

As no $5000 system I am aware of is capable of addressing that volume of ram, the question of whether the sequences are 1,000-chars or 10,000-chars long, and which would be the faster algorithm if that was the OPs requirement, simply doesn't arise--your algorithm doesn't get going, because you cannot build your MegaTrie/DFA!

The Xor algorithm does handle "several hundred thousand keys easily.

Update: I forgot to post the one line I changed in your program:

## This:
exit(main(@ARGV));

## became this:
main(@ARGV);
print "**** $_"  for `tasklist /NH /FI "PID eq $$"`;
exit(0);
[download]

Type tasklist /? on your local XP system for an explaination of the parameters.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!) by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.