in reply to Re^2: Fuzzy Searching: Optimizing Algorithm Selection
in thread Fuzzy Searching: Optimizing Algorithm Selection
For 25-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's
= 1
(CcM) * (A-1)M = 25c1 * (4-1)1 = 25 * 31 = 75.
(CcM) * (A-1)M = 25c2 * (4-1)2 = 300 * 32 = 2700.
Total 2776 keys. No problem building or with the performance of a Trie with 2776 keys.
Although the implementations currently on CPAN chew memory at a prodigious rate--a C implementation would be much less memory hungary as well as faster.
To simplify the description of the problems, I'll start with using a 4-character key from an alphabet of ACGT.
Further, we'll take the OPs description and look for matches with 0, 1 or 2 mismatched characters.
For 4-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's
= 1
(CcM) * (A-1)M = 4c1 * (4-1)1 = 4 *31 = 12
(CcM) * (A-1)M = 4c2 * (4-1)2 = 6 * 32 = 54.
Total 67 keys.
Picking one of the 256 possible keys, I'll use 'ACGT'.
The keys we need to put into our trie in order to fuzzy match 'ACGT' are:
## 0-miss (exact) match. 1 unique ACGT ## 1-miss match: 4*4 = 16 - 4 duplicates of above = 12 unique ## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where ## a key resulted from multiple sources. v v v v ## The vs indicate the columns varying. ----------------------------- ACGT°˛ AAGT˛ ACAT˛ ACGA˛ CCGT˛ ACGT°˛ ACCT˛ ACGC˛ GCGT˛ AGGT˛ ACGT°˛ ACGG˛ TCGT˛ ATGT˛ ACTT˛ ACGT°˛ ## 2-miss match: 6*4*4 = 96 - ## ( 6 duplicates of 0-miss + ( 3 each duplicates * 12 1-miss ) ) ## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where ## a key resulted from multiple sources. ## = 54 unique. vv v v v v vv v v vv -------------------------------------------- AAGTš ACATš ACGAš AAAT AAGA ACAA CAGT CCAT CCGA ACATš˛ ACGAš˛ ACCA GAGT GCAT GCGA AGAT AGGA ACGAš˛ TAGT TCAT TCGA ATAT ATGA ACTA ACGT° ACCTš ACGCš AACT AAGC ACAC CCGTš CCCT CCGC ACCTš˛ ACGCš˛ ACCC GCGTš GCCT GCGC AGCT AGGC ACGCš˛ TCGTš TCCT TCGC ATCT ATGC ACTC AGGTš ACGT°˛ ACGGš AAGTš˛ AAGG ACAG CGGT CCGTš˛ CCGG ACGT°˛ ACGGš˛ ACCG GGGT GCGTš˛ GCGG AGGTš˛ AGGG ACGGš˛ TGGT TCGTš˛ TCGG ATGTš ATGG ACTG ATGTš˛ ACTTš ACGT°˛ AATT AAGTš˛ ACATš˛ CTGT CCTT CCGTš˛ ACTTš˛ ACGT°˛ ACCTš˛ GTGT GCTT GCGTš˛ AGTT AGGTš˛ ACGT°˛ TTGT TCTT TCGTš˛ ATTT ATGTš˛ ACTTš˛ ## Giving us our 67 unique keys.
Now it's no problem to use a more intelligent generation routine to avoid generating most of the duplicates.
Indeed, noticing that the brute-force, 2-miss generation process produces all of the keys required by the 0- and 1-miss cases, it would be a simple process skip those generation steps and simple de-dup the 2-miss set.
## 2-miss match generation with duplicates removed ## With tags (°=exact, š=1-miss match) showing where ## a key resulted from multiple sources. vv v v v v vv v v vv AAGTš ACATš ACGAš AAAT AAGA ACAA CAGT CCAT CCGA ACCA GAGT GCAT GCGA AGAT AGGA TAGT TCAT TCGA ATAT ATGA ACTA ACGT° ACCTš ACGCš AACT AAGC ACAC CCGTš CCCT CCGC ACCC GCGTš GCCT GCGC AGCT AGGC TCGTš TCCT TCGC ATCT ATGC ACTC AGGTš ACGGš AAGG ACAG CGGT CCGG ACCG GGGT GCGG AGGG TGGT TCGG ATGTš ATGG ACTG ACTTš AATT CTGT CCTT GTGT GCTT AGTT TTGT TCTT ATTT
In fact, provided your Trie implementation doesn't die or bellyache when you attempt to add a duplicate key, it will automatically perform the de-duping process for you.
Once you start storing multiple sets of overlapping data into your Trie in order to make best use of it's performance, when you get a match, you can no longer distinguish whether this match occurred because it was an exact match, a 1-miss match or a 2-miss match.
So, even if you generate 1 Trie per key and apply them to each of the 976 25-char substrings in a 1000-char sequence 1 at a time (which is necessary using a Trie and has the beneficial side-effect of allowing you to know at what position within the sequence the match occured--which is a requirement), you still don't know why it matched.
Even if you arrange for the value associated with the key to be an array of tags associated with the sources of the key, that will only tell you the possible sources of the match, not the source.
So, whilst Tries are very good at doing fast lookups, and can be used for doing fuzzy lookups, the problem is that, like regex, they don't tell you what they matched.
With a regex, you can use the capture mechanism to discover this. Whilst this will slow the process somewhat, it is more than compensated for by the fact that the regex engine can be allowed to run through the entire string and locate every match and record where it found them (@- & @+), so the number of calls to the regex engine is 1/per sequence/per fuzzy match.
Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.
The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly--but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).
At this point, I thought that maybe the MegaTrie idea could be used as a way of pre-eliminating some of the nearly 30 million substrings, prior to using (say) the XOR method to derive the required information. However, the fact that the (Mega)Trie needs to be applied individually to each substring--unlike a regex for example--and a Trie lookup involves either recursive or iterative sub-calls, it is slower than just using the XOR alone.
That means that there is no saving to be had using a Trie--even for this limited purpose.
The ultimate expression of the Trie idea would be to write a DFA. Essentially this consists of a MegaTrie-like datastructure and a loop to apply that successively down the length of a string, substring-by-substring, reporting (the position(s)) of the matche(s) found. Much like the regex engine does, but failing early and moving on rather than backtracking*.
*It's worth noting that you can also use the 'cut' operator (?>...) to eliminate backtracking in the regex engine.)
This would reduce the number of lookups (again defined as a transition from Perl into C and back), to 30,000, though you still wouldn't know what matched, but the speed-up would offset the cost of further elimination.
The problems include:
But this problem involves a huge number of states, and those states are going to vary from run to run. So the task is not writing a single DFA, but that of writing a DFA generator. This is an order of magnitude more complex to do.
Not only would it need to be a fairly efficiently code DFA, it would also need to be highly optimised in its use of memory (notoriously difficult) in order that the huge number of states would fit into memory concurrently.
The regex engine has had a lot of very skilled programmers working on it for a very long time. If the OP is upto this task, he is in the wrong job as a biochemist (Assumption!).
The gulf between what is theoretically possible and what is practical is very wide.
Firstly, I'd done all the above work in order to verify my position anyway, and I thought someone might benefit from it.
Secondly, the next time someone calls you to question over something you have posted,
This is especially true of a private message between old sparing partners.
Even if it says "Utter bollocks", which is not rude, but a colloquiallism for "Utter rubbish" or "Completely in error".
You might be in error.
If it was worth posting, it is always worth a second (cool) assessment before you defend your post to the hilt.
Even if they sound remarkably similar at first reading.
He may always have learnt from prior encounters and/or further learning in the interim.
Especially when those statistics are produced using a code you cannot show, running on data that is completely unrelated to the problem at hand.
Especially when he has already published analysis, code, and verifiable statistics.
Especially when the code you are asking hm to produce would have to run against a system that you cannot show him.
Especially when that system uses data that is completely unrelated to the problem under discussion.
Especially when the information that system produces is completely inadequate for solving the problem at hand.
There is nothing wrong with being human.
But, if the result of your mistake could lead others into considerable time and effort engaging in fruitless explorations, based on your mistake, do make sure that you clearly identify them when you realise.
FWIW, most of these have been applicable to me in the past. I'm no saint (in the other sense of the word) as you well know.
Oh! And if you ask for clarifications of the problem, do give the other guy the chance to produce them before:
Thirdly, maybe a view through the "other guys eyes" of yesterday's conflagration will pursuade you to review your final act in our off-post communications?
|
|---|