Algorithm Showdown: Fuzzy Matching

A while back someone asked about the best way to do "Fuzzy Matching" on a bulk scale. He wanted to find every possible "fuzzy match" of a set of fixed width strings of a restricted alphabet (ACGT) in a set of DNA data. By "fuzzy match" he meant a match where up to a certain number of chars in the matched string could be different. This node discusses what is involved setting up a reasonably rigourous comparison of the different solutions at hand and shows some comparison results of four solutions currently known to me. In replies we can cover each solution that comes up.

Now first off in order to even begin to compare a set of solutions to this problem we will need to define a common interface. We will use perl OO to make this as painless as possible. We will do this by defining a base class that represents the basic behaviour of a "fuzzy matcher". This class has the following methods:

my $obj=Fuzzy::Matcher->new($fuzz,$strlen,$words,$wordsize);: We need a way to create a new fuzzy matcher, and of course we will call it new() as convention demands. We specify that the interface should accept four possibly relevent parameters so that we can cater to as many specific implementations as possible with changing the constructor behavior
my $obj=Fuzz::Matcher->_init();: I find its useful when using inheritance to keep the actual constructor (where bless is called) seperate from a follow up init hook as it means you can avoid messing around with SUPER when you need to change the initialization behaviour of a subclass. In the baseclass this does nothing. We use a single '_' to denote that its "private" (or "friend") and not really meant to be called from outside of the class heirarchy. Its called in void context by new() immediately before returning the object to the user with any additional arguments that were provided to new() over those it is defined to accept. It should die on error.
my $obj=$Fuzz_Matcher->fuzz_store($str);: Tell the matcher it should match $str in a search. Returns nothing. Used to load the object with the strings being searched for.
my $obj=$Fuzzy_Matcher->prepare($str);: Its quite possible that a fuzzy matcher implementation could need to do some kind of pre-search preparation once it knows no more strings will be added to it. This method provides that hook. Before searching all users of the object should call this first. Returns nothing.
my $obj=$Fuzzy_Matcher->fuzz_search($str);: Search $str for any fuzzy matches that are present. It should return a list of triplets, $ofs, $diff, $str that represent each match. $ofs is the position of the matched string in $str. The number of characters different is returned in $diff and the string matched is returned in $str. It should return the empty list if there were no matches. The user is expected to call prepare() on the object before calling this method. If they have not done so the behaviour is undefined.

And with that simple definition (and even more useful, the code it represents) any different implementations that people come up with may be easily cross tested against each other. Isnt OO fun :-)

Next we need a tool to create tests. We will want to test with a variety of different combinations of input strings and search strings so we will pull out Getopt::Long and whip together "make_test.pl" that will produce a single file that represents one set of such tests. Naturally once we have a bunch of test files we need something to do with them. For that we put together "test_fuzz.pl". We want to do testing in a pristine memory condition so we use some tricks to allow test_fuzz.pl to spawn copies of itself for individual tests and then use the outer instance to aggregate the results. And naturally we will want some pretty tables of the results. :-)

We will want to control our testing in a few ways. One of the more important being how long we will allow tests to run for. If we are doing a large batch and we know that a given solution is intolerably slow at a given scale, then tests for a larger scale will be likewise. Thus we will want to provide a way to skip tests that take too long. The --limit parameter allows us to do this.

Once we put all of this together we can produce our comparisons. We generate a bunch of test files of different types and then let test_fuzz.pl rip.

Heres a table of the data (slightly munged for html purposes) we get back. The 'N/A' are because the test too longer than 100 seconds the previous try. (It resets when the wordcount changes.) "SC" stands for string count, "SL" for string length, "F" for Fuzz, "WC" for word count and "WL" for word length.

Test File	DFA	Ysth	Xor2	Xor
FT-SC0100-SL0001000-F02-WC0000001-WL025	0.0181	2.3485	0.0322	0.3562
FT-SC0100-SL0001000-F02-WC0000005-WL025	0.0225	2.3035	0.1211	0.9044
FT-SC0100-SL0001000-F02-WC0000010-WL025	0.0321	2.3261	0.2310	1.5358
FT-SC0100-SL0001000-F02-WC0000050-WL025	0.0918	2.3370	1.1244	6.7897
FT-SC0100-SL0001000-F02-WC0000100-WL025	0.1540	2.3811	2.3071	13.5992
FT-SC0100-SL0001000-F02-WC0000500-WL025	0.6198	2.4711	11.2617	64.8216
FT-SC0100-SL0001000-F02-WC0001000-WL025	1.1031	2.6095	23.1594	133.1065
FT-SC0100-SL0001000-F02-WC0005000-WL025	4.0511	2.8281	113.3197	N/A
FT-SC0100-SL0001000-F02-WC0010000-WL025	6.5496	3.1301	N/A	N/A
FT-SC0100-SL0001000-F02-WC0050000-WL025	15.7548	5.1598	N/A	N/A
FT-SC0100-SL0001000-F02-WC0100000-WL025	20.7235	7.5497	N/A	N/A
FT-SC0100-SL0001000-F02-WC0500000-WL025	67.3387	25.3479	N/A	N/A
FT-SC0100-SL0001000-F02-WC1000000-WL025	149.0815	48.0690	N/A	N/A
FT-SC0100-SL0010000-F02-WC0000001-WL025	0.1392	26.7417	0.2729	3.7779
FT-SC0100-SL0010000-F02-WC0000005-WL025	0.1883	26.8920	1.0554	9.2859
FT-SC0100-SL0010000-F02-WC0000010-WL025	0.3048	28.7853	2.0361	16.3141
FT-SC0100-SL0010000-F02-WC0000050-WL025	0.2431	27.1403	9.7107	69.3049
FT-SC0100-SL0010000-F02-WC0000100-WL025	0.3048	27.4148	19.4087	134.0691
FT-SC0100-SL0010000-F02-WC0000500-WL025	0.7760	28.6271	96.4305	N/A
FT-SC0100-SL0010000-F02-WC0001000-WL025	1.2827	29.1948	192.0604	N/A
FT-SC0100-SL0010000-F02-WC0005000-WL025	4.4403	30.6293	N/A	N/A
FT-SC0100-SL0010000-F02-WC0010000-WL025	7.2494	31.9665	N/A	N/A
FT-SC0100-SL0010000-F02-WC0050000-WL025	18.0771	41.9295	N/A	N/A
FT-SC0100-SL0010000-F02-WC0100000-WL025	25.1417	47.7157	N/A	N/A
FT-SC0100-SL0010000-F02-WC0500000-WL025	155.8075	N/A	N/A	N/A
FT-SC0100-SL0010000-F02-WC1000000-WL025	N/A	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0000001-WL025	1.0911	536.8813	2.9989	37.2198
FT-SC0100-SL0100000-F02-WC0000005-WL025	1.0271	N/A	11.4269	91.2324
FT-SC0100-SL0100000-F02-WC0000010-WL025	1.0351	N/A	21.8465	161.7554
FT-SC0100-SL0100000-F02-WC0000050-WL025	1.4329	N/A	105.2142	N/A
FT-SC0100-SL0100000-F02-WC0000100-WL025	1.5358	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0000500-WL025	2.0794	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0001000-WL025	2.7480	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0005000-WL025	8.1556	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0010000-WL025	13.3646	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0050000-WL025	39.0437	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0100000-WL025	69.8990	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC0500000-WL025	1050.3425	N/A	N/A	N/A
FT-SC0100-SL0100000-F02-WC1000000-WL025	N/A	N/A	N/A	N/A

All the files involved will be attached via replies to keep the root a reasonable size. Discussion of the various approaches can follow those nodes.

Cheers

---
demerphq

Comment on Algorithm Showdown: Fuzzy Matching Download Code

Replies are listed 'Best First'.
Re: Algorithm Showdown: Fuzzy Matching by BrowserUk (Patriarch) on Dec 10, 2004 at 00:44 UTC
A few questions Why is this table in the root node? Why put the original, superceded Xor in that table and not the original Trie? Why the arbitrary 100 second cutoff? Why not show the build time for the DFA solution--which isn't required by any of the other solutions in the table. What is the procedure for updating the OP to include data from improved, corrected implementations? Where are the memory consumption statistics? Why is the test harness so horribly complicated? I have a file of things to search, and file of things to search for, a fuzziness number. `fsearch( \$haystacks, \$needles, $fuzz, \output );` [download] Or, I have this array of strings to search, and this array of keys to search for, at this level of fuzziness. `my @results = fsearch( \@haystacks, \@needles, $fuzz );` [download] This is inherently a functional interface. All the rest is cruft. Needless to say, I think the test methodology is flawed. 533 lines of code to do `use Benchmark qw[ cmpthese ]; cmpthese 1, \%tests;` [download] More importantly, the selective presentation of statistics in the root node of the "shootout", comparing old, first-cut code, written in an hour and long since improved and superceded; with code hand crafted, in C over a 10 days, using someone else brilliant, original idea, has devalued this exercise completely. Examine what is said, not who speaks. The end of an era!* "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^2: Algorithm Showdown: Fuzzy Matching by demerphq (Chancellor) on Dec 10, 2004 at 09:15 UTC
Why is this table in the root node? Seems like as good a place as any. Ill move it to a reply if you would prefer althought i dont see why. Why put the original, superceded Xor in that table and not the original Trie? I used your interpretation of the result requirements. Since by that interpretation my original _proof_of_concept_ implementation is not up to scratch I didnt see any point in including it. I could include whole hosts of solutions that dont meet the requirements but it wouldnt seem to add much to the debate. Also I used Xor.pm as the baseline implementation because of some comments you made in the other thread. Any proposed solution has to get the same results as Xor.pm or its probably broken. Thats one very nice advantage of a brute force algorithm, you can use them to test more sophisticated approaches. Why the arbitrary 100 second cutoff? Because with that time restriction we can see the relative performance of the different implementations without having to wait forever to complete a test run. What utility would have been added to this comparison by converting those N/A's into large numbers? Id say not that much. Feel free to run the test suite with a larger limit, or without a limit at all. I suspect youll get extremely bored of waiting for the Xor2? variants to complete on the larger test sets. I played aorund with 60 seconds, and also larger limits like 200 and it didnt seem to make much difference. 100 seconds seemed a reasonable limit. Why not show the build time for the DFA solution--which isn't required by any of the other solutions in the table. The time shows the total time, search time plus construct time to process a test file. Run the test suite if you want more precise times. Alternatively i can follow up with some additional data if you feel its necessary. And Ysth's algorithm has a prepare phase that is under some cirucumstances faster than the DFA but for others slower. Including the prepare time in the totals seemed the only fair way to do it, even though it disadvantages Ysth.pm and DFA.pm as they both can reuse their prepare time on later searches. For instance both objects could be provided with means to self serialize themselves and save most of the construct time. In order to avoid accusations of being unfair I didnt allow for such time saving in the comparison. Should I change things to allow for this? What is the procedure for updating the OP to include data from improved, corrected implementations? You can either reply or I can arrange to change the ownership of Xor.pm or Xor2.pm to you. Personally i think a reply might be better as itll allow people to see how the algorithm has evolved. But its up to you. Update:Doh. You mean the chart. Post your results and if it seems like it makes more sense to merge them into my chart than leave them as posted then i will. Just tell me what you want to do when you post the results. Also, if you come up with new variants maybe just give them new names so they can be independently tested against the rest. Where are the memory consumption statistics? Not aggregated into that chart. (Sorry I didnt think of it.) When the tests run each objects memory utilization is printed out. Please just run test_fuzz.pl as appropriate. Ill redo the test run today and change the code to include memory utilization in the output. On my machine Ysth loses that fight pretty much every time. Why is the test harness so horribly complicated? The harness is for doing bulk comparisons of different algorithms. Its also designed to ensure that the first string searched in each file is cross compared to the other result checks. Not only that but its designed to test in such a way that each test file/class combination is executed in its own process space to ensure that one algorithm cant contaminate the results of the next. A naive Benchmark for instance wouldnt do this. As an example, if DFA runs first and allocates 2MB for itself and then Xor runs Xor gets to reuse the memory used by DFA, saving itself a chunk of resource allocation costs. By executing each test run in its own process we dont have to worry about this problem. And IMO the test harness isnt that complicated. Must of the "crap" in there is for printing out nice charts of the results. Chop all of that out and test_fuzz.pl goes to half the size. Also its not like anybody directly involved in this showdown is a novice programmer so they can review tinker with and change the test suite as they like. For instance I was kinda hoping you would follow up with patches to make_test.pl so that it could be used to produce "high density return" test sets that you said would favour your algorithm. I could have done it myself but at a certain point i think its better to publish and wait for response than try to anticipate every possibile critique that can come up. This is inherently a functional interface. All the rest is cruft. I disagree. In order to meaningfully compare different implementations with different requirements its necessary to set a baseline where the most participants can play. Also IMO its much more likely that if this code gets released into the wild it would be done so as an OO module. For instance how useful is your proposed interface if you want to use data from a DB? Or generate it computationally? Id say not at all. Needless to say, I think the test methodology is flawed. 533 lines of code to do So would that actually test the algorithms? If I had used that then the orignal instance of Xor2 would have seemd to pass test, when it fact it failed test and had a bug. I mean if we just use benchmark for running our tests then my solution would be `fuzz_search { return }`. Id like to see you beat that :-) I think if you want to come up with a different test framework then do so. Please just post it here so we can review it. But if it doesnt actually test the return results then I dont think Ill consider it much of an improvement. More importantly, the selective presentation of statistics in the root node of the "shootout", comparing old, first-cut code, written in an hour and long since improved and superceded; with code hand crafted, in C over a 10 days, using someone else brilliant, original idea, has devalued this exercise completely. I dont see whats selective here. I have presented all of the working results I have, as well as all of the code involved. Please feel free to take as long as you like to come up with something better. And yes ysth made a brilliant observation, and yes i was happy to take advantage of it as it presented the perfect way to integrate my proposal (using a DFA as much as possible) and still meet your "every possible match at every possible offset" requirement. If presenting a comprehensive summary of four solutions to this problem "devalues" the exercise then I suspect we have different motives here. My motive was to find an optimal soltuion to the problem at hand not prove that any given proposal of mine was optimal. (Although I will admit to a certain satisfaction that it turns out a solution using a DFA appears to win this fight I would still have published regardless.) --- demerphq	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Algorithm Showdown: Fuzzy Matching (Files) by demerphq (Chancellor) on Dec 09, 2004 at 22:09 UTC
I'll use this as the root of the files section of this thread. --- demerphq	[reply]
Re^2: Algorithm Showdown: Fuzzy Matching (Matcher.pm) by demerphq (Chancellor) on Dec 09, 2004 at 22:12 UTC
This is the base class. It provides some hopefully useful default behaviour and provides the interface definition for any Fuzzy::Matcher implementations. Read more... (2 kB)	[reply] [d/l]
Re^3: Algorithm Showdown: Fuzzy Matching (Ysth.pm) by demerphq (Chancellor) on Dec 09, 2004 at 22:17 UTC
This is a solution by ysth. It exploits the fact that if a word of M characters is to be matched with N fuzz then if you were to divide the word into N+1 parts at least one of them would have to match exactly. He uses hashes and the regex engine to do his stuff quickly. This appears to scale reasonably well. Ill leave it to ysth to explain in more detail. Read more... (2 kB)	[reply] [d/l]
Re^3: Algorithm Showdown: Fuzzy Matching (Xor.pm) by demerphq (Chancellor) on Dec 09, 2004 at 22:14 UTC
This is one of the earlier proposals for solving the problem. It uses a brute for attack using Xor. Its quite simple. Read more... (1099 Bytes)	[reply] [d/l]
Re^3: Algorithm Showdown: Fuzzy Matching (DFA.pm) by demerphq (Chancellor) on Dec 09, 2004 at 22:28 UTC
This version is my implementation of ysths idea. Like him I split the input strings into Fuzz+1 parts and then use something like Xor to count the difference. But i use a DFA to find all of the exact matches. This is a hybrid Perl/Inline::C solution, with the storage and memory management done pretty much entirely in Perl, and the search routine done in Inline::C. It is slow to prepare() but searches are typically extremely quick. Like ysths solution on degenerate datasets it performs as well/badly as Xor.pm but these cases would seem to be unusual. The larger the ratio of WordChars/Fuzz is the faster this algorithm will perform (and the more memory it will use.) Updated:Patches applied. Fix searching strings with non alphabet chars in them. Make providing the alphabet optional. Fix error where a UV was being assigned a negative number. Minor C cleanup. Read more... (18 kB)	[reply] [d/l]
Re^4: Algorithm Showdown: Fuzzy Matching (DFA.pm) by demerphq (Chancellor) on Dec 11, 2004 at 11:36 UTC
Re^3: Algorithm Showdown: Fuzzy Matching (Xor2.pm) by demerphq (Chancellor) on Dec 09, 2004 at 22:20 UTC
This is an improved version of the Xor attack. I will leave it to BrowserUk to explain in more detail. Read more... (2 kB)	[reply] [d/l]
Re^2: Algorithm Showdown: Fuzzy Matching (Files) by demerphq (Chancellor) on Dec 09, 2004 at 22:31 UTC
And here is 'test_fuzz.pl' Read more... (18 kB)	[reply] [d/l]
Re^2: Algorithm Showdown: Fuzzy Matching (Files) by demerphq (Chancellor) on Dec 09, 2004 at 22:29 UTC
Here is make_test.pl Read more... (3 kB)	[reply] [d/l]
Re: Algorithm Showdown: (A litany of failures) by BrowserUk (Patriarch) on Dec 11, 2004 at 07:40 UTC
Okay. You, and a few others seem to think that my attempt to walk away from this implies concession and sour grapes. It doesn't. I should have known (myself) better than to think I could anyway. Here are a few (more) reasons that I reject this second round of "pissing contest": Failure 1 - a really simple test. My code works. Ysth's code works. Your's fails, but it does it really fast! Read more... (9 kB) Failure 2 -- use a slightly different alphabet. My code works and finds no matches. Ysth's code works and finds no matches. Your's produces fiftytwo thousand errors, really fast. Read more... (3 kB) Failure 3 -- verify the number of matches found is correct. For this I use the set of all 2701 possible, 25-char, 2-mismatch fuzzy matches for the subsequence 'A' x 25. By making the test sequence(s) all 'A's, then every key will match at every offset in the sequence, in every sequence. This allows me to verify that all possible positions are checked, all possible fuzzy matches are matched and allows a simple calculation to verify the total match count from an implementation. For an 'A'x1000-char sequence, and 25-char x 2-mismatch keys = (1000-25+1) * 2701 = 2636176 matches. Read more... (606 Bytes) Yep! Your code sure is quick, but look at the memory consumption! What happens if I push that sequence to 'A'x10,000 or 'A'x100,000? Or double the number of keys? Or increase the fuzz factor? Mine handles it--slowly. ysth's handles it--slowly. Yours, churns and churns, and fails-- really, really slowly. Eventually being terminated by the system due to lack of virtual memory. I have 1GB available. Read more... (667 Bytes) And I do hope that all those that disapproved of my attempt to walk away from this intensely frustrating waste of effort will notice that last test alone required nearly 5 hours of my time--at 99% cpu usage making doing anything else productive impossible. I wonder how many of those that have demonstrated their disapproval have even bothered to read the crufty code I'm having to pick apart and debunk? Unlike demerphq, I don't abandon a test after less than two minutes just cos it's inconvenient! Failure 4 -- mixed length fragments This is harder to demonstrate convincingly, and easily dismissed by reference back to the OP with it's description of 25-char sequences. However, every description I have found of the processes (Digestion and Shredding) that biochemists use in the Cloning, Mapping and Sequencing of DNA, produces large sets of fragments, of variable length. Every version of my algorithm published, and my latest, most effcient version which is one or two orders of magnitude more efficient than the best so far published, happily deals with every key having a different length. Indeed, the -KEYLEN=25 parameter shown in the examples above is internally overridden and only used on my versions for consistancy. As I stated elsewhere, I can make my code run much more quickly, by making the same assumption as both yours, and ysth's code (which he openly noted) make. I have avoided doing this because I want a generally applicable algorithm, whether for genome work or fuzzy name matching, or looking for postcodes or any other type of fuzzy match where the search keys can be any mixed bag of words. You mentioned somewhere that your algorithm could be adapted to do this, but as it relies heavily for at least part of its performance upon ysth's multi-indexing mechanism, and he has stated that his algorthm would be affected, I can only assume that your implementation would be also. The ethics of comparative studies Before any comparsions are made, and results are published, it is pretty standard practice to ensure that the algorithms under test: Accept the same inputs. Produce the same results. Make the same assuptions. Have the same limitations. Ie. If your going to fairly and accurately compare different algorithms, then you agree these criteria, or variancies from it before you published huge, dramatic sets of statistics. And when you do, you note those assumptions and variencies. I brought this to your attention well in advance of your act of "publish and be damned", and you chose to ignore it. So, you got damned. And given that we have been here before, I have no intention of withdrawing one syllable of that condemnation, even if it does go against my stated "don't be rude" philosophy. It is clear from the record of those posts, that your's was a knowing, deliberate act, and I called it as such. Summation So, let me sum up. You create an Inline-C implementation of a Trie, use it as a faster replacement for the standard hash in ysth's implementation of his brilliant, innovative algorithm, which itself relies on my xor technique to determine the fuzz factor. What part did your original algorithm play in this? Zero! Nada! Not - a - jot!. You then wrap it up in a crufty, over-engineered test harness that serves to guarantee that only idiot's like me are ever going to download, install and run it; to produce a big table of impressive looking statistics comparing this 2-week, 3-man extravagaza, with a peice of code knocked up on the spot, in under an hour, and use it "claim victory"? However, your Trie implementation--as used for those tests--is so domain specific in it's contruction, that it is severly limited in it's application. Any single change--an extra character in the alphabet (ACGTX); or lowercase letters; a fuzz factor higher than 4 in combination with any number of pathalogical testcases--causes it to variously: produce no output; or produce reams of errors and no matches; or exhaust all available memory before ever producing a single match. And the kicker. Your code does not yet produce the required information. As tested and lauded, your code outputs: The position in the sequence where the match was found. The the fuzz factor (as determined via the xor method). And the key-length sequence of characters from that position. But what is missing? The value or index position of the key that was matched. So now you have a file full of sub-sequences that match a key--which if you'd bothered to record the sequence number, as well as the offset, would be redundant anyway--all of which matched one of the (100,000 or 500,000 or 1,000,000) keys in the keys file. But which one? Sure, you can post process the two together to determine the information, but that's damn nearly as big a job as the original problem. And what will you use to fuzzy-match those fuzzy-matched sub-sequences back to the keys they matched? The xor method of course! Yes. If used within the very specific domain of inputs, your combination of ysth's multi-indexing, a fast Inline-C Trie implementation, and the xor method does result in a very fast solution to the original problem. But maybe your module should be named: `Fuzzy(not greater than 4)::Matcher(if you don't need mixed case)::DFA(+multi-index+xor)::charset(ACGT).pm` There is nothing wrong with cross-fertilisation. Indeed, I congratulated ysth on his innovation the moment I understood how clever it is. I'm in the processes of nicking a few ideas myself. It probably still won't (quite) match your performance on sparse datasets, but it will continue to handle the full range of characters, including unicode; continue to handle sets of variable length keys; continue to handle any degree of fuzz at no extra cost; continue to have constant (and low) memory consumption; and continue to produce all the required information at the first pass. But for you to post that table of meaningless drivel under the heading "Algorithm shootout", when 90% of it isn't even your algorithm and bears almost no relationship to your original "proof-of-concept" code; compare it against my off-the-cuff implementation--which still works!; Eggs with bananas! Gah! Words finally fail me. Nothing I want to add at this point is publishable. Addendum: And my detailed analysis of both your original algorithm, and this implementation of an entirely different one, has shown me nothing that would cause me to recind any of the fundamental points of my original analysis on the applicability of Tries this class of problem. Some of the projected numbers change, sometimes by an order of magnitude or two, but the underlying basic difficiencies remain. Without using ysth's innovation to allow you to avoid a brute force linear search of every sub-sequences, using a Trie would be a linear O(nm) process. And without the use of the xor algorithm to determine the fuzz factor, you would need to be storing this information in the Trie. But for the same reasons as you cannot related the fuzzy-matched subsequence back to the key that it matched, you also cannot determine the fuzz-factor without the xor method. This is as I identified, the simple reality that Tries work by uniquifying their keys. The process of producing all the fuzzy match possibility for each of the 100,000 keys will produce many duplicates and these will all be folded down to a single path in the Trie. Yes, you can associate a value with that path, but which value? An array of input-key/fuzz-factor-that-produced-this-path pairs? But then how do you determine which of those pairings was responsible for this match? And what will attaching all that extra information in the Trie do for your memory consumption? Boom after 50 keys like your "proof-of-concept"! Examine what is said, not who speaks. The end of an era! "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^2: Algorithm Showdown: (A litany of failures) by BrowserUk (Patriarch) on Dec 11, 2004 at 08:54 UTC
Here are the test scripts used in the parent post exercises. 408636-4.pl Read more... (2 kB) demerphq.pl Read more... (1326 Bytes) ysth.pl Read more... (2 kB) A simple script for automating mass testing of the above scripts, plus any other candidates over any number of tests. Not used for the exercises above, because it was written, while the above tests were running, as an antithisis of this and this, to show how simple it is to follow the K.I.S.S principle. It's trivially incomplete, but the essence of it is there and works. showdown2.pl Read more... (2 kB) Examine what is said, not who speaks. The end of an era! "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Failure 1 - a really simple test.

Failure 2 -- use a slightly different alphabet.

Failure 3 -- verify the number of matches found is correct.

Failure 4 -- mixed length fragments

The ethics of comparative studies

Summation