A couple of thoughts come to mind, but of course the level
of effort will depend on how important it is to get this correct, and how
important to do it fast:
- You could reduce each name to a list of initials with
something like tr/A-Z//cd and compare them, but I
suspect you will get a lot of false matches that way.
- Do the organizations have some other unique identifier? Corporation numbers, physician licence
numbers or something? Names are usually just a fallback when ID numbers are not available.
- You could break each name into words and use Text::Soundex or similar
(Text::Metaphone or Text::DoubleMetaphone) on the words, and somehow combining the results for comparison of the phrases.
- I often have a somewhat similar problem with street names, and I end up using a simple hash to define "synonyms" but that
always requires some manual intervention to define the hash, (or if I'm lucky I can use lat-longs to guess at synonyms). You
may need to have someone sit in front of an interactive script to verify whether two similar-looking names are in fact the same, and then
store them in a config file for later use in an automated script. The automated script should output anything that isn't
a certain match as exceptions to be manually verified later.
On the first/last name problem, you will have to somehow
determine the order used in each search (you must be able to
identify the fields right?) and do the appropriate swapping of
fields as you read. This will be annoying if you have a lot of
sources and a lot of variations, but I don't think you'll be able
to automatically detect which fields hold what with any certainty.
--
I'd like to be able to assign to an luser