in reply to Matching Names
I think I would chose to create a system to develop a "synonym" table. The automatic portion would check the "synonym" table to see if it has already been told to recognize a pair of selections as the same. After that check I might use (although I have never actually used) String::Approx with a reasonable deviation allowance. If String::Approx returned a "true" then I would probably feed the result into a group to be vetted by an actual human to confirm the match. This confirmation would then be fed into the "synonym" table to update it.
It might even be worthwhile to create an "antonym" (sp?) table which holds previously accepted matches that were declined by the human confirmation.
Finally, all items which do not come up with a match are presented to the user to locate matches for them. If matches are noted by the user, then these matches are added to the "synonym" table as well.
If this system is going to be run only once, then this process of confirm, update could be done interactively as each match is found. The hope would be that as more are processed less are found that need to be vetted. This probably one makes sense if the data set is extremely large.
The "synonym"/"antonym" and matching could be extended to include matching against "chunks" of the string; For instance we could break "SD Phys Med Grp-NC" into pieces based on the spaces and dash. We could detect SD, Phys, Med, Grp, NC as abbreviations because they are not words in a dictionary. The synonym table could tell us that SD could be "San Diego" or "San Dimas", Phys could be "Physical" or "Physician" or "Physicians"... etc. The joining of the possible expansion of the synonyms could then be tested for matches against the known as well as matches against the "synonym" table again.
Just some random musings on this.... It touches on some aspects of natural language recognition which is, to me at least, an interesting area.
|
|---|