A while back someone asked about the best way to do "Fuzzy Matching" on a bulk scale. He wanted to find every possible "fuzzy match" of a set of fixed width strings of a restricted alphabet (ACGT) in a set of DNA data. By "fuzzy match" he meant a match where up to a certain number of chars in the matched string could be different. This node discusses what is involved setting up a reasonably rigourous comparison of the different solutions at hand and shows some comparison results of four solutions currently known to me. In replies we can cover each solution that comes up.
Now first off in order to even begin to compare a set of solutions to this problem we will need to define a common interface. We will use perl OO to make this as painless as possible. We will do this by defining a base class that represents the basic behaviour of a "fuzzy matcher". This class has the following methods:
And with that simple definition (and even more useful, the code it represents) any different implementations that people come up with may be easily cross tested against each other. Isnt OO fun :-)
Next we need a tool to create tests. We will want to test with a variety of different combinations of input strings and search strings so we will pull out Getopt::Long and whip together "make_test.pl" that will produce a single file that represents one set of such tests. Naturally once we have a bunch of test files we need something to do with them. For that we put together "test_fuzz.pl". We want to do testing in a pristine memory condition so we use some tricks to allow test_fuzz.pl to spawn copies of itself for individual tests and then use the outer instance to aggregate the results. And naturally we will want some pretty tables of the results. :-)
We will want to control our testing in a few ways. One of the more important being how long we will allow tests to run for. If we are doing a large batch and we know that a given solution is intolerably slow at a given scale, then tests for a larger scale will be likewise. Thus we will want to provide a way to skip tests that take too long. The --limit parameter allows us to do this.
Once we put all of this together we can produce our comparisons. We generate a bunch of test files of different types and then let test_fuzz.pl rip.
Heres a table of the data (slightly munged for html purposes) we get back. The 'N/A' are because the test too longer than 100 seconds the previous try. (It resets when the wordcount changes.) "SC" stands for string count, "SL" for string length, "F" for Fuzz, "WC" for word count and "WL" for word length.
| Test File | DFA | Ysth | Xor2 | Xor |
|---|---|---|---|---|
| FT-SC0100-SL0001000-F02-WC0000001-WL025 | 0.0181 | 2.3485 | 0.0322 | 0.3562 |
| FT-SC0100-SL0001000-F02-WC0000005-WL025 | 0.0225 | 2.3035 | 0.1211 | 0.9044 |
| FT-SC0100-SL0001000-F02-WC0000010-WL025 | 0.0321 | 2.3261 | 0.2310 | 1.5358 |
| FT-SC0100-SL0001000-F02-WC0000050-WL025 | 0.0918 | 2.3370 | 1.1244 | 6.7897 |
| FT-SC0100-SL0001000-F02-WC0000100-WL025 | 0.1540 | 2.3811 | 2.3071 | 13.5992 |
| FT-SC0100-SL0001000-F02-WC0000500-WL025 | 0.6198 | 2.4711 | 11.2617 | 64.8216 |
| FT-SC0100-SL0001000-F02-WC0001000-WL025 | 1.1031 | 2.6095 | 23.1594 | 133.1065 |
| FT-SC0100-SL0001000-F02-WC0005000-WL025 | 4.0511 | 2.8281 | 113.3197 | N/A |
| FT-SC0100-SL0001000-F02-WC0010000-WL025 | 6.5496 | 3.1301 | N/A | N/A |
| FT-SC0100-SL0001000-F02-WC0050000-WL025 | 15.7548 | 5.1598 | N/A | N/A |
| FT-SC0100-SL0001000-F02-WC0100000-WL025 | 20.7235 | 7.5497 | N/A | N/A |
| FT-SC0100-SL0001000-F02-WC0500000-WL025 | 67.3387 | 25.3479 | N/A | N/A |
| FT-SC0100-SL0001000-F02-WC1000000-WL025 | 149.0815 | 48.0690 | N/A | N/A |
| FT-SC0100-SL0010000-F02-WC0000001-WL025 | 0.1392 | 26.7417 | 0.2729 | 3.7779 |
| FT-SC0100-SL0010000-F02-WC0000005-WL025 | 0.1883 | 26.8920 | 1.0554 | 9.2859 |
| FT-SC0100-SL0010000-F02-WC0000010-WL025 | 0.3048 | 28.7853 | 2.0361 | 16.3141 |
| FT-SC0100-SL0010000-F02-WC0000050-WL025 | 0.2431 | 27.1403 | 9.7107 | 69.3049 |
| FT-SC0100-SL0010000-F02-WC0000100-WL025 | 0.3048 | 27.4148 | 19.4087 | 134.0691 |
| FT-SC0100-SL0010000-F02-WC0000500-WL025 | 0.7760 | 28.6271 | 96.4305 | N/A |
| FT-SC0100-SL0010000-F02-WC0001000-WL025 | 1.2827 | 29.1948 | 192.0604 | N/A |
| FT-SC0100-SL0010000-F02-WC0005000-WL025 | 4.4403 | 30.6293 | N/A | N/A |
| FT-SC0100-SL0010000-F02-WC0010000-WL025 | 7.2494 | 31.9665 | N/A | N/A |
| FT-SC0100-SL0010000-F02-WC0050000-WL025 | 18.0771 | 41.9295 | N/A | N/A |
| FT-SC0100-SL0010000-F02-WC0100000-WL025 | 25.1417 | 47.7157 | N/A | N/A |
| FT-SC0100-SL0010000-F02-WC0500000-WL025 | 155.8075 | N/A | N/A | N/A |
| FT-SC0100-SL0010000-F02-WC1000000-WL025 | N/A | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0000001-WL025 | 1.0911 | 536.8813 | 2.9989 | 37.2198 |
| FT-SC0100-SL0100000-F02-WC0000005-WL025 | 1.0271 | N/A | 11.4269 | 91.2324 |
| FT-SC0100-SL0100000-F02-WC0000010-WL025 | 1.0351 | N/A | 21.8465 | 161.7554 |
| FT-SC0100-SL0100000-F02-WC0000050-WL025 | 1.4329 | N/A | 105.2142 | N/A |
| FT-SC0100-SL0100000-F02-WC0000100-WL025 | 1.5358 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0000500-WL025 | 2.0794 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0001000-WL025 | 2.7480 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0005000-WL025 | 8.1556 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0010000-WL025 | 13.3646 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0050000-WL025 | 39.0437 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0100000-WL025 | 69.8990 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC0500000-WL025 | 1050.3425 | N/A | N/A | N/A |
| FT-SC0100-SL0100000-F02-WC1000000-WL025 | N/A | N/A | N/A | N/A |
All the files involved will be attached via replies to keep the root a reasonable size. Discussion of the various approaches can follow those nodes.
Cheers
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Algorithm Showdown: Fuzzy Matching
by BrowserUk (Patriarch) on Dec 10, 2004 at 00:44 UTC | |
by demerphq (Chancellor) on Dec 10, 2004 at 09:15 UTC | |
| |
|
Re: Algorithm Showdown: Fuzzy Matching (Files)
by demerphq (Chancellor) on Dec 09, 2004 at 22:09 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:12 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:17 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:14 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:28 UTC | |
by demerphq (Chancellor) on Dec 11, 2004 at 11:36 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:20 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:31 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 22:29 UTC | |
|
Re: Algorithm Showdown: (A litany of failures)
by BrowserUk (Patriarch) on Dec 11, 2004 at 07:40 UTC | |
by BrowserUk (Patriarch) on Dec 11, 2004 at 08:54 UTC | |
| |
|