Fuzzy String matching with index?

mir192a has asked for the wisdom of the Perl Monks concerning the following question:

Hey - I'm working on a bioinformatics app. I need to do basically a fuzzy string match, and I need to determine the index at which the match was predicted (the length of said "match" would be cool too, but that is secondary).

I have tried String::Approx, String::Compare, agrep, egrep, of course regexs, String::Similarity, Text::Levenshtein, Algorithm::HowSimilar, et al. So far the best results have come by using agrep to do the initial fuzzy match (with the -# parm for # of allowed mismatches) and then using String::Approx aslice with the minimal distance option to determine the index and length. This is far from ideal.

The issues that I'm seeing with this strategy (besides the lack of cool factor):
1. Levenshtein inherently works for real strings (like the ones people use) not strings of RNA. For example, there are lots of "ACGG"s in my matched string, and the first one found probably isn't the one the fuzzy match found.

2. String::Approx plainly admits to have unreliable reports of both index and length - that is true.

3. The allowed mismatches parameters in String::Approx don't seem to be honored in all circumstances.

4. Speed - I will have to repeat this fuzzy match logic hundreds of thousands of times in a trial. Neither file from which I pull these strings will fit in memory.

5. Anchoring - I'd like to be able to anchor the match to the middle...the characters near the square brackets in string 3 are the most important.

Example: I want to know if string 1 kind of matches string 2. I'd really like to say "allow exactly 2 substitutions" or "tell me about stuff that matches 92%", etc. I'm agrepping (or amatching, both using Levenshtein) for string 1 in string 2. If nothing is found, I slice off a char from each end of string 2 and try the agrep/amatch again. I'm currently finding a "match" when string 2 has been reduced to string 3. At this time my reported index is 7, and length is 8. That is a good result - but it would be much better if the length was right every time, it was faster, and I could have finer control over the allowed edit distance.

string 1 - AATCCCTGGATTGAAGCGCaa
string 2 - 252: rs17347089 [Homo sapiens] TAATAAAAACTATATTCCTGGAATT[C/T]^GTTCTAAGTAAAAAAGTTTTCAAATT
string 3 - GGATTGA

If anyone knows of a better way to do this, I would greatly appreciate any advise at all. Thank you - Mandi

20060906 Janitored by Corion: Added code tags, as per Writeup Formatting Tips

Comment on Fuzzy String matching with index? Select or Download Code

Replies are listed 'Best First'.
Re: Fuzzy String matching with index? by BrowserUk (Patriarch) on Sep 05, 2006 at 04:01 UTC
This subject has come up before. My pure perl approach is explained in Re: Fuzzy Searching: Optimizing Algorithm Selection and there is a greatly improved performance version at Re^2: Fuzzy Searching: Optimizing Algorithm ( A few improvements).. Also of note is ysth & demerphq's compiled XS trie approach the code for which can be found in this Algorithm Showdown: Fuzzy Matching. Be warned: Both these threads contain ultimately futile, heated technical debate. I still have, several further iterations of performance improvement of my code kicking around. If you are seriously interested and want to try and take it forward, I could look them out. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Fuzzy String matching with index? by ysth (Canon) on Sep 05, 2006 at 05:32 UTC
Both these threads contain ultimately futile, heated technical debate. Not ultimately futile, in that they ultimately lead demerphq to work on adding trie optimization to perl's regex engine. :)	[reply]
Re^3: Fuzzy String matching with index? by BrowserUk (Patriarch) on Sep 05, 2006 at 06:03 UTC
When is 5.10 likely to shrug off is mythical status and become manifest? Seems I first heard about all the good stuff that was gonna come in 5.10, defined-OR etc. way back in 2002. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Fuzzy String matching with index? by Corion (Patriarch) on Sep 05, 2006 at 06:52 UTC
Re: Fuzzy String matching with index? by ysth (Canon) on Sep 05, 2006 at 03:40 UTC
A problem very like this came up in Fuzzy Searching: Optimizing Algorithm Selection. Discussion continued in Algorithm Showdown: Fuzzy Matching.	[reply]
Re: Fuzzy String matching with index? by iankorf (Acolyte) on Sep 06, 2006 at 23:27 UTC
Inexact string matching is still an active field of research both on the algorithmic and statistical fronts. Some of the more popular algorithms in bioinformatics index the strings and use heuristics to speed up the search. There are also optimizations tied to particular hardware features such as vector units and FPGAs. Most of the programs (eg. BLAST, BLAT, CrossMatch, Exonerate, FASTA, etc.) are written in C with a few exceptions (eg. PatternHunter uses Java). The concept of an edit distance is useful if you assume that all symbols are equally probable and substitutable, and there are no context dependencies. Biological sequences are not strings, however, and if one is looking for something biological, it often behooves one to use a stochastic grammar rather than a regular one. This is why hidden Markov models have become so popular. In protein space, HMMER is a popular software package (also written in C). Not many people use HMMs in nucleotide space, but they should. These books might help "Biological Sequence Analysis: A Probabilistic Approach": Durbin, Eddy, Krogh, and Mitchison, Cambridge University Press "BLAST": Korf, Yandell, and Bedell, O'Reilly & Associates "Algorithms on Strings, Trees, and Sequences": Gusfield, Cambridge University Press	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks