in reply to Some kind of fuzzy logic.
First, it helps to strip the noise from the company names, such as Inc, Co, Corp, GMBH, LTD, etc.
String::Approx finds a distance by looking at insertions, deletions, and substitutions needed to transform one string to another.
A different approach, which worked better for me, was to make lists of all the substrings of length n in the source string. I called these n-tuples. I compared the percentage overlap between the n-tuple sets for each name in one list to the n-tuples for each word in the other list. The best value for the length n of the tuples was three or four.
Very close matches could be completely automated this way. For matches that were not so close, I finished the matching task manually. I made a web user interface that had a selection list of the match candidates ranked by the closeness of the match. The closeness was determined by the percentage of n-tuples that matched. I selected the best match for each entry on the amongst these top-ranked match candidates.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Approximate matching of company names
by the_0ne (Pilgrim) on Oct 19, 2003 at 18:36 UTC | |
by toma (Vicar) on Oct 30, 2003 at 04:26 UTC | |
by jelevin (Initiate) on Jul 02, 2011 at 17:45 UTC | |
by Corion (Patriarch) on Jul 02, 2011 at 18:10 UTC |