Well BrowserUk im afraid the numbers just dont agree with you at all.
No. Your figures don't agree. Your very selective figures.
The OP presented the question in the form
The basic problem:
Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:
- exact
- fuzzy (1- and 2-letter mismatches)
within the dictionary.
Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.
I've highlighted the (IMO, which is as good as any other as the OP hasn't returned to clarify matters) significant figures in that quote.
Your table is very interesting. Particularly in what it doesn't show.
The first thing I noticed was that the biggest number in the OPs presentation of the problem, was the one you chose to vary through the most limited range?
He specified, clearly, 100,000 25-char keys ("words" as you refer to them). You only varied this value through a range of 1, 2, 5 & 10. I wondered why?
The following is the output from your unmodified program, apart from the addition of one extra line of diagnostics which I'll post at the bottom.
[ 5:40:10.75] P:\test\demerphq>del *.fuzz rem Ensure a clean plate [ 5:40:16.21] P:\test\demerphq>dir Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test\demerphq 18/11/2004 05:40 <DIR> . 18/11/2004 05:40 <DIR> .. 18/11/2004 05:38 9,597 demerphq.pl 1 File(s) 9,597 bytes 2 Dir(s) 48,390,365,184 bytes free [ 5:40:21.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=1 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor: 4.453125 Trie: 10.390625 (1.859375 + 8.53125) **** perl.exe 416 0 12,320 K [ 5:40:43.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=10 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 10 StringCount: 1000 Xor: 21.0625 Trie: 95.796875 (2.328125 + 93.46875) **** perl.exe 3196 0 78,164 K [ 5:42:51.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=100 2>&1 | find "****" #### The above run self-terminated (crashed) because it was out of mem +ory!!! #### [ 5:49:20.62] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=20 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 20 StringCount: 1000 Xor: 41.453125 Trie: 195.71875 (2.5 + 193.21875) **** perl.exe 2708 0 149,812 K [ 5:53:46.56] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=30 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 30 StringCount: 1000 Xor: 59.59375 Trie: 293.75 (2.59375 + 291.15625) **** perl.exe 3504 0 222,616 K [ 6:00:00.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=40 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 40 StringCount: 1000 Xor: 79.265625 Trie: 404.96875 (2.734375 + 402.234375) **** perl.exe 3412 0 293,504 K [ 6:13:48.76] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=50 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 50 StringCount: 1000 Xor: 97.546875 Trie: 494.3125 (2.796875 + 491.515625) **** perl.exe 3524 0 366,868 K
As you can see, I filtered the output to show just the headline numbers, and the extra line of diagnostics I added. I've also hand wrapped the output for the usual reasons.
The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours.
If you look at the third run I attempted above, you'll see that I annotated it as running out of memory. I did a little math based on the memory consumption figures output using the MS tasklist.exe utility.
That's pretty consistant, and linear, at > 70 MB / 10 keys.
So, calculating the RAM requirement to build the MegaTrie DFA that your algorithm depends on for it's speed, comes out to:
As no $5000 system I am aware of is capable of addressing that volume of ram, the question of whether the sequences are 1,000-chars or 10,000-chars long, and which would be the faster algorithm if that was the OPs requirement, simply doesn't arise--your algorithm doesn't get going, because you cannot build your MegaTrie/DFA!
Update: I forgot to post the one line I changed in your program:
## This: exit(main(@ARGV)); ## became this: main(@ARGV); print "**** $_" for `tasklist /NH /FI "PID eq $$"`; exit(0);
Type tasklist /? on your local XP system for an explaination of the parameters.
In reply to Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!)
by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection
by Itatsumaki
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |