Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)

Why tries won't work for this application.

For 25-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

0-miss (exact.)
= 1
1-miss
(CcM) * (A-1)^M = 25c1 * (4-1)¹ = 25 * 3¹ = 75.
2-miss
(CcM) * (A-1)^M = 25c2 * (4-1)² = 300 * 3² = 2700.

Total 2776 keys. No problem building or with the performance of a Trie with 2776 keys.

Although the implementations currently on CPAN chew memory at a prodigious rate--a C implementation would be much less memory hungary as well as faster.

The problems.

To simplify the description of the problems, I'll start with using a 4-character key from an alphabet of ACGT.

Further, we'll take the OPs description and look for matches with 0, 1 or 2 mismatched characters.

For 4-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

0-miss (exact)
= 1
1-miss
(CcM) * (A-1)^M = 4c1 * (4-1)¹ = 4 *3¹ = 12
2-miss
(CcM) * (A-1)^M = 4c2 * (4-1)² = 6 * 3² = 54.

Total 67 keys.

Picking one of the 256 possible keys, I'll use 'ACGT'.

The keys we need to put into our trie in order to fuzzy match 'ACGT' are:

## 0-miss (exact) match. 1 unique
ACGT

## 1-miss match: 4*4 = 16 - 4 duplicates of above = 12 unique
## With tags (°=exact; ¹=1-miss match; ²=2-miss match) showing where 
## a key resulted from multiple sources.
v        v        v        v  ## The vs indicate the columns varying.
-----------------------------
ACGT°²  AAGT²   ACAT²   ACGA²          
CCGT²   ACGT°²  ACCT²   ACGC²          
GCGT²   AGGT²   ACGT°²  ACGG²          
TCGT²   ATGT²   ACTT²   ACGT°²         

## 2-miss match: 6*4*4 = 96 - 
## ( 6 duplicates of 0-miss + ( 3 each duplicates * 12 1-miss ) )
## With tags (°=exact; ¹=1-miss match; ²=2-miss match) showing where 
## a key resulted from multiple sources.
## = 54 unique. 
vv      v v     v  v     vv      v v      vv  
--------------------------------------------
AAGT¹   ACAT¹   ACGA¹   AAAT    AAGA    ACAA
CAGT    CCAT    CCGA    ACAT¹²  ACGA¹²  ACCA
GAGT    GCAT    GCGA    AGAT    AGGA    ACGA¹²
TAGT    TCAT    TCGA    ATAT    ATGA    ACTA

ACGT°   ACCT¹   ACGC¹   AACT    AAGC    ACAC
CCGT¹   CCCT    CCGC    ACCT¹²  ACGC¹²  ACCC
GCGT¹   GCCT    GCGC    AGCT    AGGC    ACGC¹²
TCGT¹   TCCT    TCGC    ATCT    ATGC    ACTC

AGGT¹   ACGT°²  ACGG¹   AAGT¹²  AAGG    ACAG
CGGT    CCGT¹²  CCGG    ACGT°²  ACGG¹²  ACCG
GGGT    GCGT¹²  GCGG    AGGT¹²  AGGG    ACGG¹²
TGGT    TCGT¹²  TCGG    ATGT¹   ATGG    ACTG

ATGT¹²  ACTT¹   ACGT°²  AATT    AAGT¹²  ACAT¹²
CTGT    CCTT    CCGT¹²  ACTT¹²  ACGT°²  ACCT¹²
GTGT    GCTT    GCGT¹²  AGTT    AGGT¹²  ACGT°²
TTGT    TCTT    TCGT¹²  ATTT    ATGT¹²  ACTT¹²

## Giving us our 67 unique keys.
[download]

Now it's no problem to use a more intelligent generation routine to avoid generating most of the duplicates.

Indeed, noticing that the brute-force, 2-miss generation process produces all of the keys required by the 0- and 1-miss cases, it would be a simple process skip those generation steps and simple de-dup the 2-miss set.

## 2-miss match generation with duplicates removed 
## With tags (°=exact, ¹=1-miss match) showing where 
## a key resulted from multiple sources.
vv      v v     v  v     vv      v v      vv
AAGT¹   ACAT¹   ACGA¹   AAAT    AAGA    ACAA
CAGT    CCAT    CCGA                    ACCA
GAGT    GCAT    GCGA    AGAT    AGGA         
TAGT    TCAT    TCGA    ATAT    ATGA    ACTA

ACGT°   ACCT¹   ACGC¹   AACT    AAGC    ACAC
CCGT¹   CCCT    CCGC                    ACCC
GCGT¹   GCCT    GCGC    AGCT    AGGC         
TCGT¹   TCCT    TCGC    ATCT    ATGC    ACTC

AGGT¹           ACGG¹           AAGG    ACAG
CGGT            CCGG                    ACCG
GGGT            GCGG            AGGG         
TGGT            TCGG    ATGT¹   ATGG    ACTG

        ACTT¹           AATT                 
CTGT    CCTT                                 
GTGT    GCTT            AGTT                 
TTGT    TCTT            ATTT
[download]

In fact, provided your Trie implementation doesn't die or bellyache when you attempt to add a duplicate key, it will automatically perform the de-duping process for you.

And there's the rub

Once you start storing multiple sets of overlapping data into your Trie in order to make best use of it's performance, when you get a match, you can no longer distinguish whether this match occurred because it was an exact match, a 1-miss match or a 2-miss match.

So, even if you generate 1 Trie per key and apply them to each of the 976 25-char substrings in a 1000-char sequence 1 at a time (which is necessary using a Trie and has the beneficial side-effect of allowing you to know at what position within the sequence the match occured--which is a requirement), you still don't know why it matched.

Even if you arrange for the value associated with the key to be an array of tags associated with the sources of the key, that will only tell you the possible sources of the match, not the source.

So, whilst Tries are very good at doing fast lookups, and can be used for doing fuzzy lookups, the problem is that, like regex, they don't tell you what they matched.

With a regex, you can use the capture mechanism to discover this. Whilst this will slow the process somewhat, it is more than compensated for by the fact that the regex engine can be allowed to run through the entire string and locate every match and record where it found them (@- & @+), so the number of calls to the regex engine is 1/per sequence/per fuzzy match.

MegaTrie

Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.

The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly--but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).

Fast pre-elimination?

At this point, I thought that maybe the MegaTrie idea could be used as a way of pre-eliminating some of the nearly 30 million substrings, prior to using (say) the XOR method to derive the required information. However, the fact that the (Mega)Trie needs to be applied individually to each substring--unlike a regex for example--and a Trie lookup involves either recursive or iterative sub-calls, it is slower than just using the XOR alone.

That means that there is no saving to be had using a Trie--even for this limited purpose.

Determanistic Finite-state Automaton

The ultimate expression of the Trie idea would be to write a DFA. Essentially this consists of a MegaTrie-like datastructure and a loop to apply that successively down the length of a string, substring-by-substring, reporting (the position(s)) of the matche(s) found. Much like the regex engine does, but failing early and moving on rather than backtracking*.

*It's worth noting that you can also use the 'cut' operator (?>...) to eliminate backtracking in the regex engine.)

This would reduce the number of lookups (again defined as a transition from Perl into C and back), to 30,000, though you still wouldn't know what matched, but the speed-up would offset the cost of further elimination.

The problems include:

All the DFA modules I look at on CPAN used hash-based objects to represent the DFA-states, and as such would require huge amounts of memory to capture the complexity of this problem.
Writing a C-based DFA is non trivial for a few, known states.
But this problem involves a huge number of states, and those states are going to vary from run to run. So the task is not writing a single DFA, but that of writing a DFA generator. This is an order of magnitude more complex to do.
The shear volume of states involved means that it would be no good writing a poor implementation of a DFA and relying upon the natural speed of C to see you through.
Not only would it need to be a fairly efficiently code DFA, it would also need to be highly optimised in its use of memory (notoriously difficult) in order that the huge number of states would fit into memory concurrently.
It would take a very highly skilled (C) programmer to write a DFA for this task that would out-perform the Perl regex engine and some suitably carefully crafted regex (using the cut operator).
The regex engine has had a lot of very skilled programmers working on it for a very long time. If the OP is upto this task, he is in the wrong job as a biochemist (Assumption!).
The gulf between what is theoretically possible and what is practical is very wide.

The point of all this?

Firstly, I'd done all the above work in order to verify my position anyway, and I thought someone might benefit from it.

Secondly, the next time someone calls you to question over something you have posted,

Don't get so upset by the form of the message that you ignore the content.
This is especially true of a private message between old sparing partners.
Even if it says "Utter bollocks", which is not rude, but a colloquiallism for "Utter rubbish" or "Completely in error".
Don't assume that you must be right and the other guy must be wrong.
You might be in error.
If it was worth posting, it is always worth a second (cool) assessment before you defend your post to the hilt.
Don't conflate the matter to which the message relates, with other, older issues, long dead.
Don't assume that because you have used the methods in question successfully on some other project, that the same methods will be applicable to the problem at hand.
Even if they sound remarkably similar at first reading.
Don't assume that your understanding of the CS issues involved are superior to the other guys understanding, no matter what history between you dictates.
He may always have learnt from prior encounters and/or further learning in the interim.
Don't expect the other guy to accept your statistics as proof.
Especially when those statistics are produced using a code you cannot show, running on data that is completely unrelated to the problem at hand.
Don't expect the other guy to produce code to validate your argument.
Especially when he has already published analysis, code, and verifiable statistics.
Especially when the code you are asking hm to produce would have to run against a system that you cannot show him.
Especially when that system uses data that is completely unrelated to the problem under discussion.
Especially when the information that system produces is completely inadequate for solving the problem at hand.
Don't think that you can never make a mistake.
There is nothing wrong with being human.
But, if the result of your mistake could lead others into considerable time and effort engaging in fruitless explorations, based on your mistake, do make sure that you clearly identify them when you realise.

FWIW, most of these have been applicable to me in the past. I'm no saint (in the other sense of the word) as you well know.

Oh! And if you ask for clarifications of the problem, do give the other guy the chance to produce them before:

Raising the ante (by going public);
Engaging him in extended bouts of discussion that would be unnecessary or simplified by that clarification.
Firing back several, extended batches of dismissals, provarications and "fuck off you dickhead"s at him.

Thirdly, maybe a view through the "other guys eyes" of yesterday's conflagration will pursuade you to review your final act in our off-post communications?

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.) Select or Download Code

Replies are listed 'Best First'.
Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings) by demerphq (Chancellor) on Nov 17, 2004 at 21:42 UTC
Well BrowserUk im afraid the numbers just dont agree with you at all. I put together a hybrid Trie/DFA solution, (code is in the readmore at the bottom of this node). It shows quite clearly that the Trie/DFA will win in many cases against your XOR approach. Even though my trie implementation is Pure Perl (with an extremely inefficient DFA construction routine) it outperforms your algorithm in every circumstance I tested it in _except_ the 25/1000/1000 tests that you were using. And even then when I make the string being searched just 10 times larger XOR loses outright. Also any possible benefits that may accrue from OS level file caching goes to benefit the XOR algorithm as because the Trie algorithm most often finished first I put it first in the test code. I didnt have sufficient tuits to redo the implementation using Hybrid Perl/Inline::C which would have just increased the difference between the routines. Basically the performance for the Trie/DFA is nonlinear for construction, and linear for the actual scan. This of course means that once constructed the search time for any given piece of data will theoretically be some constant times the length of the string and in practice we see some pretty much that with some modest change probably due to memory management costs. Alternate and more compact representations could be chosen to make this more efficient. Since the constructed Trie can be serialized for reuse the construction costs could in many situations be essentially ignored. For this reason i seperated out the time actually running the state machine from the time constructing it. I will admit that there are two subtle differences between the results from the two approaches. Yours will show mutliple hits for a single offset if they exist wheras mine will only show one. I dont consider this in any way to make your solution superior as it is a fairly trivial issue to resolve as one could simply maintain a hash of all the keys in the trie and any other keys that are anagrams of that key. Similarly this particular implementation doesnt necessarily handle strings of different sizes as one might expect. As this is meant to be a proof of concept and the implementation you posted is hardcoded to handle 25 char words and as such suffers the same problem I didn't see that as a show stopper. I should say lastly that calling this a DFA not be entirely correct. Possibly a better term would FSA (finate state automata). The reason being that normally neither NFA or DFA regex engines will match overlapping words. Ie if the string is "ABCABD" and we have 'BC' and 'CA' as accepting strings then it would match ONLY "BC". My implementation doesnt care about that and would match both. Regardless there are any number of ways to build a state machine for the purpose at hand and mine is probably one of the least efficient. The following table shows the various test scenarios that I used. (All times are in seconds.) The last column shows how much longer the XOR solution took than the Trie solution. Where the number is negative (and in bold) is where the XOR based solution beats the Trie based soltion. You'll notice that only one scenario is in this category: that being when the strings are 1000 chars and the words are 25 chars. A ratio of 1:40. However once we use a ratio of 1:100 or so this reverses itself quite drammatically. The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours. For something like searching vast quantities of DNA strings I imagine a technique like this would do quite well. OTOH this is a trade off. I can imagine scenarios where your approach would be better. But for hardcore pattern matching on modern machines with lots of ram id be looking at a DFA. Notice that by the time we start talking about 100k being searched for 10 words the trie runs in 26 minutes less. And this is Pure Perl and in the opinion of the author a pig. :-) Read more... (timing data) (6 kB) Read more... code and usage details (17 kB) A footnote: you mentioned a number of things about a DFA/Trie not being able to do. I suggest to you that you were mostly wrong. Adding meta data to states in a DFA is relatively easy to do. For instance accepting states can have associated data (in fact in the implementation above its the very presence of that data that indicates its an accepting state.) --- demerphq	[reply] [d/l] [select]
Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!) by BrowserUk (Patriarch) on Nov 18, 2004 at 08:24 UTC
Well BrowserUk im afraid the numbers just dont agree with you at all. No. Your figures don't agree. Your very selective figures. The OP presented the question in the form The basic problem: Matching a 25-letter string* against a dictionary of about 30 thousand variable length sequences. I need to find all:* exact fuzzy (1- and 2-letter mismatches) within the dictionary. Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries. I've highlighted the (IMO, which is as good as any other as the OP hasn't returned to clarify matters) significant figures in that quote. Your table is very interesting. Particularly in what it doesn't show. The most important point is that your algorithm does not scale! The first thing I noticed was that the biggest number in the OPs presentation of the problem, was the one you chose to vary through the most limited range? He specified, clearly, 100,000 25-char keys ("words" as you refer to them). You only varied this value through a range of 1, 2, 5 & 10. I wondered why? The following is the output from your unmodified program, apart from the addition of one extra line of diagnostics which I'll post at the bottom. [ 5:40:10.75] P:\test\demerphq>del .fuzz rem Ensure a clean plate [ 5:40:16.21] P:\test\demerphq>dir Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test\demerphq 18/11/2004 05:40 <DIR> . 18/11/2004 05:40 <DIR> .. 18/11/2004 05:38 9,597 demerphq.pl 1 File(s) 9,597 bytes 2 Dir(s) 48,390,365,184 bytes free [ 5:40:21.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=1 2>&1 \| find "*" WordSize: 25 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor: 4.453125 Trie: 10.390625 (1.859375 + 8.53125) perl.exe 416 0 12,320 K [ 5:40:43.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=10 2>&1 \| find "" WordSize: 25 StringSize: 1000 WordCount: 10 StringCount: 1000 Xor: 21.0625 Trie: 95.796875 (2.328125 + 93.46875) perl.exe 3196 0 78,164 K [ 5:42:51.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=100 2>&1 \| find "" #### The above run self-terminated (crashed) because it was out of mem +ory!!! #### [ 5:49:20.62] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=20 2>&1 \| find "" WordSize: 25 StringSize: 1000 WordCount: 20 StringCount: 1000 Xor: 41.453125 Trie: 195.71875 (2.5 + 193.21875) perl.exe 2708 0 149,812 K [ 5:53:46.56] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=30 2>&1 \| find "" WordSize: 25 StringSize: 1000 WordCount: 30 StringCount: 1000 Xor: 59.59375 Trie: 293.75 (2.59375 + 291.15625) perl.exe 3504 0 222,616 K [ 6:00:00.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=40 2>&1 \| find "" WordSize: 25 StringSize: 1000 WordCount: 40 StringCount: 1000 Xor: 79.265625 Trie: 404.96875 (2.734375 + 402.234375) perl.exe 3412 0 293,504 K [ 6:13:48.76] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=50 2>&1 \| find "" WordSize: 25 StringSize: 1000 WordCount: 50 StringCount: 1000 Xor: 97.546875 Trie: 494.3125 (2.796875 + 491.515625) ** perl.exe 3524 0 366,868 K [download] As you can see, I filtered the output to show just the headline numbers, and the extra line of diagnostics I added. I've also hand wrapped the output for the usual reasons. The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours. You even predicted the problem, but dismissed it If you look at the third run I attempted above, you'll see that I annotated it as running out of memory. I did a little math based on the memory consumption figures output using the MS `tasklist.exe` utility. 1 x 25-char key requires 12 MB. 10 x 25-char keys (words) requires 78 MB 20 x 25-char keys (words) requires 149 MB (delta for 10 words: 71 MB.) 30 x 25-char keys (words) requires 222 MB (delta for 10 words: 73 MB.) 40 x 25-char keys (words) requires 293 MB (delta for 10 words: 71 MB.) 50 x 25-char keys (words) requires 366 MB (delta for 10 words: 73 MB.) That's pretty consistant, and linear, at > 70 MB / 10 keys. So, calculating the RAM requirement to build the MegaTrie DFA that your algorithm depends on for it's speed, comes out to: 100,000 x 25-char keys (words) requires 696 Gigabytes of RAM. As no $5000 system I am aware of is capable of addressing that volume of ram, the question of whether the sequences are 1,000-chars or 10,000-chars long, and which would be the faster algorithm if that was the OPs requirement, simply doesn't arise--your algorithm doesn't get going, because you cannot build your MegaTrie/DFA! The Xor algorithm does handle "several hundred thousand keys easily. Update: I forgot to post the one line I changed in your program: ## This: exit(main(@ARGV)); ## became this: main(@ARGV); print "**** $_" for `tasklist /NH /FI "PID eq $$"`; exit(0); [download] Type `tasklist /?` on your local XP system for an explaination of the parameters. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Nope.) by demerphq (Chancellor) on Nov 18, 2004 at 09:44 UTC
The Xor algorithm does handle "several hundred thousand keys easily. No the Xor algorith does NOT handle several hundred thousand keys easily. The rate of growth on the run time for searching a string 100k long is too high. Look at the numbers, the algorithm gets slower the more keys there are involved. So slow that you could construct Trie/DFA for a subset and run them in series and still outperform your algorithm. I chose the numbers I chose becuase if your algorithm blows out at those numbers it has already totally lost the game. Look at the numbers. For a 10 x25 char words searching 100k strings my solution took 5.25 minutes. Your solution took half an hour. So you could run mine 6 times and yours still wouldnt have finished. Thus mine would do 60 words in 30 minutes while yours was still struggling with 10 words. Adding words appears to make your solution grow faster than mine. Now mine could be made even more efficient. As I said the construction routine im using is far from optimal. Im pretty sure it can be made to run in linear time to the number of keys. Not only that but all of your memory calculation are like your other calculations: misleading. An optimized respresentation for this problem should require only 20 bytes per node (5 * length(long)). Consider a single empty hash and the reference to it (of which there are hundreds of thousands in use in my prototype code) takes more memory than that, at least double as an SV itself takes 28bytes. So the scale of problems this solution can handle in production quality code gets much much larger than the toy implementation here. Anyway, I made an assertion in my original node in this thread. Ive now proved it with _real_ numbers and working code. You claimed it was nonsense. It has proved to not to be. And all of your wriggling and trying to rewrite the situation to make your claims true dont work. I made an assertion and my assertion has proved true. Within the parameters it is feasable to use a DFA/Trie it will smoke your Xor solution for cases where the string being searched is fairly long. Your objections to that claim have proved incorrect, as has your math. You confidentally calculated that it would take years for a solution like this to run and you have been proved utterly wrong. Even if you went with the current code and ran it 100 times it still would only be 525 minutes to search for 1000 words over 1000x100k strings. Looking at the run time between 5 words and 10 words over 1000x100k strings for your solution we see a 1000 second increase. Since your operation will scale proportionally to the number of words and the length of the string we can conclude that to process 1000 words on the same data set will take your solution about 3000 minutes. Which is 2500 minutes LONGER than mine. Thats 41 hours longer. And you want to go this route with 100k words? Using the same logic 100k words would take your solution 20 million minutes, thats ~38 years. Versus the approximately 36 days it would take mine. I know which one i would prefer to wait for. To make this picture even worse for your solution id like to point out that in my earlier code I was effectively doing $Fuzzy=5 by accident and it was still outpacing your solution. That added tens of thousands of words to the search list. If it was possible to do that with a pure perl soltion imagine what a String/Struct/Inline::C/Perl hybrid would do. Using a small fraction of the ram and doing the search operations in C means that my solution just gets faster. All told I think its quite clear that I have proved my point in my original post. A state machine based scan will outperform anything posted in this thread. And it has. I know you will come back with more of the same here, but I'm no longer interested. You insulted me and you posted a bunch of nonsense math saying im all wrong. And now when i prove quite conclusively that you were wrong have you gone an updated any of that to point out its wrong? Have you added any caveats anywhere? Isnt this why you were insulting and abusive in the CB to me? You objected to me assertion my proposal was more efficient than yours and not being more clear in the caveat I added at the bottom, what about your nodes? They are full of totally nonsensical equations and arguments supposedly proving how my approach just cant work, but hey it outperforms yours. Thats a fact that you cant get around no matter how hard you try. --- demerphq	[reply]
Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!) by BrowserUk (Patriarch) on Nov 27, 2004 at 13:24 UTC
demerphq The numbers in your table are meaningless. Your solution is broken. To show this, I ran your test harness (slightly modified to accept filenames from the command line and disable the "only output the first 50 matches" code) on two small testfiles. The first was generated using this small script which generates all the variations of a specified length key; I used 10-char keys (406 variations) because your prototype code cannot handle 25-chars (2701 variations). `#! perl -slw use strict; our $LEN \|\|= 10; my $s = 'A' x $LEN; print $s; for my $p1 ( 0 .. $LEN -1 ) { for my $c1 ( qw[ C G T ] ) { for my $p2 ( $p1 .. $LEN -1 ) { next if $p1 == $p2; for my $c2 ( qw[ C G T ] ) { my $t = $s; substr $t, $p1, 1, $c1; substr $t, $p2, 1, $c2; print $t; } } } } __END__ P:\test\demerphq>keyGen -LEN=10 >test.words P:\test\demerphq>wc -l test.words 406 test.words` [download] Here i've printed the first and last 10 to show the sequence: `P:\test\demerphq>u:head test.words AAAAAAAAAA CCAAAAAAAA CGAAAAAAAA CTAAAAAAAA CACAAAAAAA CAGAAAAAAA CATAAAAAAA CAACAAAAAA CAAGAAAAAA CAATAAAAAA P:\test\demerphq>u:tail test.words AAAAAAATAT AAAAAAAACC AAAAAAAACG AAAAAAAACT AAAAAAAAGC AAAAAAAAGG AAAAAAAAGT AAAAAAAATC AAAAAAAATG AAAAAAAATT` [download] I used the following, 11-char single sequence for the test: `P:\test\demerphq>type test.seq AAAAAAAAAAA` [download] The thing to note is that every one of the 406 keys in test.words will match that single 11-char sequence. Twice, once at offset 0, and again at offset 1. Now the run. Read more... (19 kB) The explaination of why your code only finds two matches, and attributes them to exact matches and mine finds those two + 810 more? Simple. Your Trie implementation will only report the first match it finds at any given position. If you delete the first key in test.words 'AAAAAAAAAA', then it will still report finding 'AAAAAAAAAA'! because every other key in that file can be considered a fuzzy match for 'AAAAAAAAAA'. What this means is that your code is only doing a fraction of the work that mine is doing (and that is required). In the case of the 10-char keys above, you code is only doing 1/406th (2.5%) of the required work. By the time you get to 25-char keys, this drops to 1/2701th (0.03%). For the explaination of those numbers, you need only go back and read my analysis , and make some attempt to understand those "...totally nonsensical equations and arguments supposedly proving how my approach just cant work, ..." So, for your "...but hey it outperforms yours..." claims. If you only do a tiny fraction of what is required--sure you'll do that quicker. Even that claim is bogus if you cannot build the datastructure required to achieve your "performance". Update: Section removed. I screwed up. 'B' is not a part of the alphabet 'ACGT' under test. There are myriad other things wrong with both your algorithm and your implementation of it. Your home-brew timing code incorporates it's own runtime into the things under test. Your logging code--in particular, the code that limits the number of matches output, for no good reason other than beautification--imposes a substantial overhead on both our algorithms. By accumulating the matches in an array, instead of outputting them as the are found. By spending time formatting (push @time, join, @_, sub sprintf ... join, map{ sprintf } @matches ) them into a single line, rather than just printing them! If you get around to performing your timings without that code in place you'll find that not only does it markedly alter the results, but that the impact of it, due to the differences in our algorithms (and nesting levels at which it is done), is disproportionate. I'll let you work out for yourself which algorithm is most affected. Your protoype -v- production, Perl --v- C claims "...imagine what a String/Struct/Inline::C/Perl hybrid would do. Using probably less than 1/100th of the ram..." (since silently modified.) are also totally without basis. Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), ~~nearly an order of magnitude~~ 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor. All in all--eminently worthy of my "Utterly bollocked" private /msg. I could gone on, but then this would seem like a hatchet job. It isn't. I've expended the best part of this last week on this analysis, on top of the time I put into that I had already done. That you so readily dismissed as "...totally nonsensical equations and arguments...". This is the 5th revision of this document. Each time I've attempted to remove the vitriol from it. Reading it back now I see I still probably haven't achieved that as well as people will expect. Sorry, but every time I read this and this, it gets me so hot under the collar that it just comes out. So be it, whatever is left will stand. If this amount of effort is rewarded with downvotes that's fine by me. Two final comments: It is a truism that "War always causes an arms race; technology always advances faster under war than under peace". And so it is. As a result of your pushing, and the penny clicking that an optimisation alluded to in the OP (Option #2), I have now improved the performance of my algorithm, especially on long sequences (strings) by almost an order of magnitude compared to the code I previously published. That still means that the OPs stated requirements will take a substantial amount of time, but then they are substantial requirements. I probably wouldn't have pursued that optimisation had you not so incensed me. Imagine what we could do if we worked together, instead of against each other? Of course, it seems to be human nature that we (man in general; and you and I specifically), cannot make this transition. I think that I see a way to combine the XOR method with a (series of small) Tries such that it would achieve an even faster solution whilst staying within the realms of reality as far as memory consumption is concerned. I've implemented a generic, pure Perl Trie that uses substantially less memory (though it is slightly slower than yours). If things pan out, you might hear more on that. In any case, the Trie implementation has some rather interesting other uses, so you might here more about that anyway. Examine what is said, not who speaks. "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!) by ysth (Canon) on Nov 28, 2004 at 19:56 UTC
Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), ~~nearly an order of magnitude~~ 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor. So do it in multiple passes, with as many keywords as fit comfortably in memory. As you say, the problem has substantial requirements. And surely the bulk of the problem is to find which offsets match; identifying the particular strings that match at a given offset, if this is indeed necessary, is much easier and can be a separate step.	[reply]
Re^7: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!) by BrowserUk (Patriarch) on Nov 28, 2004 at 21:07 UTC
Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Sigh) by demerphq (Chancellor) on Nov 29, 2004 at 07:42 UTC
You havent proved anything here. (Except that you dont understand how my solution works, the list you built meant my code was trying to match all _4_ digit fuzzy strings, and not the _2_ digits we were originally discussing, this presumably is where you get the misconception that it wont handle 25 digit keys.) The behaviour you have posted here is exactly as I predicted. And its correct. As I stated in my earlier post and as ysth pointed out as well its trivial to determine what keys also match if a given key matches. So you havent proved anything here. I could have written the code so that instead of outputing only the literal string matched it would output a precalculated list of all the possible variants. Which actually brings me to yet another mathematical problem with your code. Your list omits the 30 words that have only 1 digit different. Point in fact you say my idea doesnt work. So I posted working code. Then you say its broken, when in fact it does exactly what it was advertised to do. Face it BrowserUk you are and were wrong. Your XOR approach is slow, so slow as to be useless for nontrivial searches. The state machine is fast, and where it has an up-front overhead can be reused over and over. No matter what you do you arent going to get around that. Sorry. I suggest you drop this, all you are doing is embarrasing yourself now. --- demerphq	[reply]
Re^7: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Sigh) by BrowserUk (Patriarch) on Nov 29, 2004 at 11:49 UTC
Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.) by tall_man (Parson) on Nov 14, 2004 at 23:02 UTC
Since all that a DFA has to keep track of is the position within the test string and the number of fuzzy matches made so far, it would need no more than (number of characters in the test string)*(number of fuzzies allowed + 1) states. That is only 75 states for a string of length 25 with two errors allowed. I haven't coded the DFA yet, but here is looping code that illustrates the idea. Read more... (1502 Bytes)	[reply] [d/l]
Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.) by BrowserUk (Patriarch) on Nov 14, 2004 at 23:29 UTC
That is only 75 states for a string of length 25 with two errors allowed. That is for one search key. There are 100,000 to process. Against 30,000 x 1000-char strings. The "... the complexity of this problem." I spoke of with respect to a DFA, is the idea that you could combine all 100,000 keys into a single, huge DFA. If you applied your code above--or any pure perl implementation of a DFA--against each of the 30,000 strings, for each of the 100,000 keys, would it out-perform Perl's regex engine? Even if you coded seperate, 75-state DFA using C for each of the 100,000 keys, and then applied them to the 30,000 sequences, would it out perform Perl's regex engine? Do you fancy the task of hand coding 100,000 75-state DFAs in C? And then doing it again next week for another 100,000 keys?. Obviously, you wouldn't hand-code all the DFAs individually. You would write a DFA generator and let the computer do the work. Writing DFA generators is non-trivial, Even once you have it working, you still have to apply 100,000 DFAs to each of the 30,000 sequences. So, rather than generate individual DFAs, you write your generator to build one, humungous DFA that encapsulates the state of all, 75 states of all 100,000 keys. Combining the DFA this way probably reduces the 7.5 million states to less that 10%? That's still an awefully complex DFA to generate, and an even more complex DFA generator to design, code, test, debug--in C. If you did do this, the advantage is that you only have one DFA to apply to the 30,000 sequences, which would almost certainly be the quickest of all the techniques described in this thread. It's just a case of doing the work. Does this sound familiar? If you carefully craft your regexes to avoid backtracking, you are using the regex engine to generate a fairly close approximation to a custom DFA for each regex. No, horribly compex, difficult to get right, C work to do. It's already available in Perl. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]

Why tries won't work for this application.

The problems.

And there's the rub

MegaTrie

Fast pre-elimination?

Determanistic Finite-state Automaton

The point of all this?

The most important point is that your algorithm does not scale!

You even predicted the problem, but dismissed it

100,000 x 25-char keys (words) requires 696 Gigabytes of RAM.

The Xor algorithm does handle "several hundred thousand keys easily.