in reply to Fuzzy Strings
If you do take the hash approach, you might want to consider letter frequency. Since the letter 's' is more frequent than the letter 'q', does that mean that 'said' is a closer match to 'laid' than 'qaid'? (yes, that's a word)
Also, do you know that with String::Approx that you can adjust the number of "edits"? For example, for a word with only two characters of difference, you can specify:
You could set the number of edits to 1 and if that doesn't return a list to examine, just keep increasing the number of edits until you get something.my @catches = amatch("plugh", ['2'], @inputs);
You may also want to check out Text::Soundex which will encode words into four character strings that represents what they "sound" like. Then, you can compare the shorter strings. I don't know how reliable this is and it's only for the English language.
A final option to consider is Text::Metaphone, which does phonetic encoding of words. You could then check to see if words sound the same (yeah, I know, this is a longshot). I do not know if this is for languages other than English.
Since you have a "fuzzy" problem, there is going to be no simple solution to this problem and you will have quite a time working with this, I'm sure. However, it might make a nifty module for CPAN, when finished.
Cheers,
Ovid
|
---|