jakeeboy has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem and I have yet to solve it with String::Approx. We have 2 or 3(if not more) databases with Physician and/or Group Names. Of course no one can adhere to a common way of entering the data so now I'm left with matching up the 2-3 different ways the data is entered. I've tried and tried...and tried to get all the names to match but haven't found a way that works 95% of the time. Does anyone know something that could match these samples:

SD Physicians Med Grp-NC San Diego Physicians Medical Group-North Coastal SD Phys Med Grp-NC

And this is just one sample. Each Physician/Group Name could be spelled many different ways but I can't figure out how to match them up. I've tried compressing the string to not include any spaces and puncuation. Then sort the letters and check if to see if at least 90% of the letters are in both. Then there is the problem that some databases/spreadsheets input a name First Middle Last, Title. And others Last, First Middle, Title. If anyone has any ideas please let me know. Thanks

Replies are listed 'Best First'.
Re: Matching Names
by Sifmole (Chaplain) on May 16, 2001 at 21:50 UTC
    It seems there are three basic "deviations" you have to deal with:
    • Abbreviations -- San Diego -> SD
    • Ordering -- First Middle Last -> Last, First Middle
    • Misspellings
    Looks like a tall order to do without any feedback loop.

    I think I would chose to create a system to develop a "synonym" table. The automatic portion would check the "synonym" table to see if it has already been told to recognize a pair of selections as the same. After that check I might use (although I have never actually used) String::Approx with a reasonable deviation allowance. If String::Approx returned a "true" then I would probably feed the result into a group to be vetted by an actual human to confirm the match. This confirmation would then be fed into the "synonym" table to update it.

    It might even be worthwhile to create an "antonym" (sp?) table which holds previously accepted matches that were declined by the human confirmation.

    Finally, all items which do not come up with a match are presented to the user to locate matches for them. If matches are noted by the user, then these matches are added to the "synonym" table as well.

    If this system is going to be run only once, then this process of confirm, update could be done interactively as each match is found. The hope would be that as more are processed less are found that need to be vetted. This probably one makes sense if the data set is extremely large.

    The "synonym"/"antonym" and matching could be extended to include matching against "chunks" of the string; For instance we could break "SD Phys Med Grp-NC" into pieces based on the spaces and dash. We could detect SD, Phys, Med, Grp, NC as abbreviations because they are not words in a dictionary. The synonym table could tell us that SD could be "San Diego" or "San Dimas", Phys could be "Physical" or "Physician" or "Physicians"... etc. The joining of the possible expansion of the synonyms could then be tested for matches against the known as well as matches against the "synonym" table again.

    Just some random musings on this.... It touches on some aspects of natural language recognition which is, to me at least, an interesting area.

Re: Matching Names
by Albannach (Monsignor) on May 16, 2001 at 21:54 UTC
    A couple of thoughts come to mind, but of course the level of effort will depend on how important it is to get this correct, and how important to do it fast:

    • You could reduce each name to a list of initials with something like tr/A-Z//cd and compare them, but I suspect you will get a lot of false matches that way.
    • Do the organizations have some other unique identifier? Corporation numbers, physician licence numbers or something? Names are usually just a fallback when ID numbers are not available.
    • You could break each name into words and use Text::Soundex or similar (Text::Metaphone or Text::DoubleMetaphone) on the words, and somehow combining the results for comparison of the phrases.
    • I often have a somewhat similar problem with street names, and I end up using a simple hash to define "synonyms" but that always requires some manual intervention to define the hash, (or if I'm lucky I can use lat-longs to guess at synonyms). You may need to have someone sit in front of an interactive script to verify whether two similar-looking names are in fact the same, and then store them in a config file for later use in an automated script. The automated script should output anything that isn't a certain match as exceptions to be manually verified later.

    On the first/last name problem, you will have to somehow determine the order used in each search (you must be able to identify the fields right?) and do the appropriate swapping of fields as you read. This will be annoying if you have a lot of sources and a lot of variations, but I don't think you'll be able to automatically detect which fields hold what with any certainty.

    --
    I'd like to be able to assign to an luser

Re: Matching Names
by chipmunk (Parson) on May 16, 2001 at 23:10 UTC
    You may find the module Lingua::EN::MatchNames useful, although I believe it's intended for matching names of people. I recall that The Perl Journal had an article on this module a few issues back. (As I'm writing this, the TPJ website does not yet have the old articles back online.)
Re: Matching Names
by traveler (Parson) on May 16, 2001 at 22:03 UTC
    You might look into this node. It implements Knuth's Soundex algorithmn -- a help for matching names with different spellings. While it clearly is not a whole solution, it might be a start for dealing with misspelled names.
Re: Matching Names
by tune (Curate) on May 16, 2001 at 21:43 UTC
    It seems really difficult, since you don't know what to expect.
    Try to expand the abbreviated names, e.g.: SD => San Diego, or counter-wise (abbreviate names) San Diego => SD, to make it more universal.

    Hope it worth a bit

    --
    tune