(Ovid) Re: Fuzzy Strings

Well, that depends upon how you define "best match." The first step in tackling any problem is defining it clearly. By best match, do you mean the most letters in common? If so, you could potentially takes all strings and break their letters into a hash with the value being the occurence of each letter and do a foreach loop over the keys and keep a count of the differences. Of course, this doesn't take into account the "order" of the letters. It would find "notnilC" as matching "Clinton."

If you do take the hash approach, you might want to consider letter frequency. Since the letter 's' is more frequent than the letter 'q', does that mean that 'said' is a closer match to 'laid' than 'qaid'? (yes, that's a word)

Also, do you know that with String::Approx that you can adjust the number of "edits"? For example, for a word with only two characters of difference, you can specify:

my @catches = amatch("plugh", ['2'], @inputs);
[download]

You could set the number of edits to 1 and if that doesn't return a list to examine, just keep increasing the number of edits until you get something.

You may also want to check out Text::Soundex which will encode words into four character strings that represents what they "sound" like. Then, you can compare the shorter strings. I don't know how reliable this is and it's only for the English language.

A final option to consider is Text::Metaphone, which does phonetic encoding of words. You could then check to see if words sound the same (yeah, I know, this is a longshot). I do not know if this is for languages other than English.

Since you have a "fuzzy" problem, there is going to be no simple solution to this problem and you will have quite a time working with this, I'm sure. However, it might make a nifty module for CPAN, when finished.

Cheers,
Ovid

Comment on (Ovid) Re: Fuzzy Strings Download Code