in reply to Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!)
in thread Fuzzy Searching: Optimizing Algorithm Selection
Okay. Let's assume for the moment that we've successfully par'ed down the memory consumption of the Trie to 1/20th of the prototypes usage. That means that we can now successfully build a Trie from 1000 keys instead of 50. For the 100,000 keys you need 100 passes of the 30,000 sequences.
The performance of the Trie comes from trading the time consumed building the Trie, against the speed of scanning the sequences; and amortising the build time across the sequences scanned.
But now we have to build 100 trees. Thus we've raised that overhead by 2 orders of magnitude. On top of that, we now have to scan all the sequences 100 times; multiplying the scan time by 2 orders of magnitude.
If you take a look at the numbers (in the program output), you'll see that building a Trie with a single 25-char words takes around 8.5 seconds; and for 50 x 25-char words 491 seconds (9.84 sec/word). Based on those numbers*, you can expect building a Trie of 1000x25-char words to take at least 2 hrs 21 minutes (8.5secs * 1000). Building 100 of those will take seven months! And that's before we start scanning the 30,000 sequences 100 times each.
*It is possible that this build-time might be reduced by moving to a C-implementations. It's also possible that it would not. Most of the time taken building the Trie comes not from pointer chasing, or anything else that Perl is significantly slower than C for. Most of the time stems from memory allocation and management. Perl is actually very good at this. Indeed, the fastest Perl builds are those that use Perls built-in malloc/free routines in preference to those in your local C-library. Perl does this well--and reliably. Any savings that might comes from implementing the memory management in C would come at the expense of reliability and at considerable develoment cost. Doing memory management in C is hard. Perl's routines have had many years of evolution. It would be the height of arrogance to think that you are going to do this better at your 1st or 2nd or even 5th attempt.
identifying the particular strings that match at a given offset, if this is indeed necessary, is much easier and can be a separate step.
Now you're faced with the task of comparing each of the matching substrings against each of the 100,000 keys and then determining if it is a 0, 1, or 2-mismatch for each of them.
And how will you make the fuzzy-level determination? Fall back to the Xor method?
The question of whether it is important to know what level of mis-match, each of the matched substrings matched with, is an open question; that only the OP can answer. I think it likely is, but I'm guessing.
However, knowing which of the 100,000 keys the substring matched against is important. As I said before, I'm not a Biochemist or a Genetic Engineer, but I find it extrememly unlikely that only knowing that the 25 base-pairs at offset:3129 of sequence:29,001, matched(possibly exactly, possibly with one mis-match;; possibly with 2 mis-matches) against one of these 100,000 x 25-base pair 'bits', is likely to be of any great use in solving the original problem?
This is the same conclusion I reached when I originally looked at using Tries for this problem in Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.). To quote myself:
MegaTrie
Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.
The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly-- but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).
|
|---|