Try either String::Approx or String::Similarity. The both modules are different in their approach: the former returns all matches based on an "error" or "fuzziness" parameter while the latter returns a similarity factor of two strings. Both can be tailored for your needs.
| [reply] |
No, that won't work because the modules are too general. "dhc15-E" and "Sony camcoder 15-e" should be a match, but something like "dhc16-E" should not match, as that will be a camcorder of a different type. But the modules you mention won't have the knowledge what the strings mean, and will consider "dhc15-E" and "dhc16-E" quite similar - as they differ by only one character.
| [reply] |
This looks like you need to put some custom logic in it. From your example it looks like the substring "dhc" might be substituted with "Sony camcorder". Maybe you can try to use a number of such mappings to get a canonical form. I can also imagine that dashes and spacing may differ, so strip all non-characters before comparing.
| [reply] |
Perhaps you need a mixture of the two suggested approaches. When the best possible match is found by one of the string aproximation packages comparing to all matches already known you asign it as a best guess match. You also add this match to a list for human review of strings that were matched and the cannonical product name. Once a human reviewer agrees a match is good it goes into the hash of know matchesYou will never get 100% as some very different products may be given the same name (e.g. an F15 could be an aircraft or a sunscreen)
Cheers, R. | [reply] |
Do you know all the variations in advance? If you do I would suggest a look up table (hash). The keys would be all the variations you would expect and the value would be whatever is in your array.
| [reply] |