in reply to Merge/Purge address data

Procecessing duplicate manually entered free-form records is hard to do by hand, and darn near impossible by machine.

It's hard enough parsing names, and streets, sometimes even figuring out what city someone is in can be challenging.

I would being by parsing out the street name and number, and the person's last name. Use those to generate a similarity index, for people at the same address, and flag those with a match exceeding some command-line specified value. Then you can examine those personally, and make a decision.

Mind you, my father was Thomas T. Legrady, and so am I. I occassionally get people who think I've been dead for ten years, but I insist on continuing to breath.

--
TTTATCGGTCGTTATATAGATGTTTGCA

Replies are listed 'Best First'.
Re: Re: Merge/Purge address data
by qq (Hermit) on Nov 11, 2003 at 21:37 UTC

    And the IRS has three times confused me with my father, who shares the same name (and, yes, we long ago shared the same address). He is not a US citizen, however. My theory is that, confronted with two similar records, one with a SSN and one without, they assume the one without is just missing it, and merge them.

    To the OP - this is a very difficult problem, IMHO. So you need to think about what you can do if you have a low confidence in your match - is the information still useful?

    There may be also be many times when you have two datasets that appear to match, but in reality refer to very different beings. I guess it all depends on your data.

    There are various tools to canonicalize an address, but I don't know of any free ones that aren't for personal use only. See http://www.cedar.buffalo.edu/adserv.html