cleverett has asked for the wisdom of the Perl Monks concerning the following question:

I want to write a script to automatically merge name and address data from one database table into another.

I'd like my script to know that these:

are very likely the same address, and

are likely the same name, and automatically merge

+-----------------------+------------------------+ | Dr. Alan Balfour, | 123 North Main, Apt. D | | A. Balfour, MD, FACPS | 123 N. Main #D | +-----------------------+------------------------+

into a single record, especially when both records hold other information that's very similar.

I can think of 3 different approaches:

  1. use Baye's Theorem (kind of like SpamAssassin)
  2. Fuzzy Logic
  3. Homebrew Heuristics

Personally, I think that the Bayes approach would probably serve best, but I'm looking for all the clue I can get before I start serious hacking on this.

Replies are listed 'Best First'.
Re: Merge/Purge address data
by liz (Monsignor) on Nov 11, 2003 at 09:12 UTC
    Before you start doing this, I think you also need to consider the country from which the address info comes. Your examples are typically American. Addresses are done differently within the EU, and still differently in India, Japan and China.

    One example I think might trip you up is the title Drs., which in the Netherlands means "Person working on his first doctorate thesis" and in Germany means "Person having completed multiple doctorate theses".

    Ok, maybe the example is a bit artificial, but it should give you an indication of what you're getting yourself into.

    Good luck! And make it a CPAN module so that others may benefit!

    Liz

    Update:
    I just thought of a typical address in the Netherlands:

    Admiralengracht t.o. 281

    The "t.o. is Dutch for "tegenover", which means "opposite". It indicates an address of a house boat, opposite of the house numbered "281" on the "Admiralengracht".

      And you should see what they do in Thailand!

      Not to mention how a naive person from a developing country might write any type of address ...

      I believe I need to limit the problem space to US addresses, rather than solve the problem for the global address space ...

      But as far as I can tell, a large enough training set would enable a Bayesian or fuzzy solution to distinguish 'twixt Dutch and German 'Drs.'

      PS. How uncanny! My younger sister has a linguistics doctorate, her first language is German, her second Dutch!

Re: Merge/Purge address data
by castaway (Parson) on Nov 11, 2003 at 11:17 UTC
    I think the best way is to use something that works for each part specifically.. (Eg something for a name, surname, address, zip etc) and which you can teach which parts should be regarded as equivalent.. (eg MD = Dr., A. can equal Alan, etc.)

    As someone else said, comparing bits which arent likely to change is a good start, the house number, the surname, the zip code..

    Good luck with this, BTW, my company sells a pretty expensive product to merge telephone subscriber data doing this sort of thing.. :) - Maybe I can find out our approach.. (I don't work in that dept.. but.. :)

    C.

      so:

      1. parse names and address into more discrete chunks
      2. consider some columns more relevant than others

      Boy, 2 looks like a fuzzy logic thing to me ... maybe some kind of scoring system ...

Re: Merge/Purge address data
by EvdB (Deacon) on Nov 11, 2003 at 10:30 UTC
    I would suggest that you identify which bits of the data are the most likely to vary if the addresses are different. For example if the addresses have zip codes attached then if those differ the addresses differ too.

    Basically I would only get cosy with baysian filters et all after wiping out similar addresses with identical zip codes or similar.

    --tidiness is the memory loss of environmental mnemonics

      It's very unpredictable what we get.

      OTOH, we can get reams of seemingly unrelated data, such as IP addresses what might help figure out what's happening.

      And difficult to distinguish and search for "similar" addresses. I thought Bayesian filters or fuzzy logic might provide a means to that end.

      The address data are left by users via web form.

      Depending on if they clean out their cookies or use a different computer or the phase of the moon they may or may not end up creating a duplicate address record.

      Even having the same email, wouldn't make 2 records necessarily the same person. In our database, we have 50 husband-wife pairs sharing the same email address for example. With one pair, their record are different only in two different respects: A different first name and a different CV.

      After thinking about it most of the night, what I think I need to accomplish is:

      1. develop a way of measuring how similar two strings are to each other, where 0 means nothing in common and 1 means identity.
      2. tweak the above a bit for different columns ... 123 Main Street and 123 Elm Street are absolutely different addresses
      3. look at what the level of similarity each column in 2 different rows says about how similar the records are as a whole
      4. develop a metric for measuring the quality of data so that I can select "Alan F. Balfour" over "A Balfour" when I merge the data
      5. figure out how to cut down the number of candidates my algorithm attempts matching a particular row to

      The last looks tricky to me, I might be able to live without it.

        Depending on how many addresses you have and if this is a one off you might be well off doing 4 yourself. You could write a script that would do 1, 2 and 3 and then interactively ask you whether to merge and what to merge.

        In fact if you did this then you could add bits of code to do 4 as you went along. This way you will see the problems that are cropping up and will have a good idea of what is required to fix them.

        Added: As I understand it baysian filters learn from experience. Maybe you could get the filter to look at you decisions above and learn from them. Potential to wander off into AI and expert systems here.

        --tidiness is the memory loss of environmental mnemonics

Re: Merge/Purge address data
by exussum0 (Vicar) on Nov 11, 2003 at 13:47 UTC
    I offer no concrete solution, but I will explain why you can't do bayes. bayes theorum is a classification method. You classify a subset of having some quality and then ask the quesetion, "does this unclassified text belong in that subset". It does that by statistics of the reuse of words across the subset and the percentage of chance they will be seen again.

    i.e. if 80% of a predetermined of "smart essays" contained the word "oxymoron" or "phlem", then if a new. essay was brought along with the word phlem, there's a high chance it would be smart and a very low chance it wouldn't be.

    So, if you have a "high usage" of smart words, you'd have a "smart essay". Spamassassin (w/ bayes) works the same way. Anytime it finds spam, classified by you or the sytem (figuring it is spam by other methods), it analizes it and says, "this mail has these words in it. adjust how common the words are to my prior knowledge and increase their relative percentages that if these words appear, it's prolly a spam message". Towards the opposite end of things, if a mesage isn't spam, it lowers the percentages of the words appearing in non spam.

    It's why bayes filtering requires training. If you get a lot of mail about visual basic that you didn't solicit, but you've marked as spam, those words would increase the chance that your message is spam. But if I were a visual basic programmer, that'd be stupid, since I'd expect a lot of mail about visual basic and may be interested. So "basic might have a high percentage for you and low for me. Thus, the training process.

    So you see, classifying how common one thing is to another requirs a sample as an example. Having variations of people's names and classifying them as "yeah, this is bob smith" or "no, this isn't" would requier prior analysis as examples of various combinations. For instance, I sign my name in one of 3 ways. If a 4th variation came up, my signif other would say, "I've seen how he commonly does it.. this 4th way doesn't have qualities of the 3rd, so it isn't his most likely."

    So unless you feel like sitting and building various variations by hand, and then verifying them later to train your filter for various addresses (yes, Dr. Alan; Dr Alan Belfour, Dr Alan Belfour MD are the same person) and generage a small database per person on who is and who isn't said person, don't do bayes It doesn't work on arbitrary data w/o trainong on that type of data. Your types change as each person is a type w/ a specific addres. plus, you'd create a database for each, "is this vs is this not," comparison. For small n people, it'd be great.

    You might get away w/ some other homebrew. I.e. stripping all abreviations, and adding common ways of representing various parts of the name in an sorted array and seeing which array is it most common to. Usign soundex to compare the parts would work. It's not bayes since you ahven't done any statistical analysis to train it (haven't trained the statistical analysis?). All-in-all, this can be a very expensive op, since humans like to type freeform, and freeform typing is always hard to analize. "William gates" "bill gates" "gates, bill" "mr bill gates ceo" ....


    Play that funky music white boy..

      Good explanation of Bayes. Certainly clarified things for me ...

      I was actually thinking of doing Bayes on different ways that the data could get mangled.

      I think I have an idea of what might work:

      1. Compute the "distance" from one string to another in terms of 3 different operations.
        • inserting a character
        • deleting a character
        • swapping substrings
        Having it run from 0 being as far apart as possible to 1 being identical strings would be ideal.
      2. Manually figure out a point scoring system so that for each column in the two records I add and subtract points from an overall score based on a function of the distance separating the two strings.
      3. Call the records duplicates when the overall score goes over a predefined threshold.

      So superficially at least it looks like SpamAssassin ...

        1. Compute the "distance" from one string to another in terms of 3 different operations.
          • inserting a character
          • deleting a character
          • swapping substrings

        Are you talking about Levensthein distance? From the documentation of Text::Levensthein:

        The Levenshtein edit distance is a measure of the degree of proximity between two strings. This distance is the number of substitutions, deletions or insertions ("edits") needed to transform one string into the other one (and vice versa). When two strings have distance 0, they are the same.

        There is also a Text::LevenstheinXS. I have not used any of these modules myself, though, so I can't say anything more about their qualities.

        Perhaps this might save you some work. Good luck!

        pernod
        --
        Mischief. Mayhem. Soap.

Re: Merge/Purge address data
by JamesNC (Chaplain) on Nov 11, 2003 at 11:52 UTC
    I think you need to rethink how you are storing your data in the database. (If it is relational I mean.) If you stored a unique_id for Dr. Balfour, this would be a trivial issue of joining several tables on his id. And, while you might be somewhat successful at doing some complex merge, typos, data-entry and human errors will futher corrupt your best algorithm ( ie. how will you verify the data you distill? -- Hence the old adage about garbage in, garbage out ). I think the answer is much simpler to add keys to your existing data and normalize it so that you can use the power of a relational database. That is the whole reason they were invented IMHO. Cheers, JamesNC

      Heh,

      No stranger to third normal form am I. I think you miss my point.

      Information coming off a web form where a strict "log in with a password to use the site" policy is NOT in effect will contain tons of blank records and near duplicates. But making users do the extra work of selecting a user name and password raises a barrier, which means we get a lot less data.

      The solution I have in place:

      1. gathers data from a web form into one set of tables
      2. cleans up that data
      3. merges cleaned data into another table for other work

      I'm looking to automate steps 2 and 3.

        Depending what your site is offering, you may want to give an option for a user to select a "code" that they can enter to automatically fill in their name and address.

        This of course assumes the user has some reason to fill in address information with some attempt at accuracy.

        Just some thoughts on how dageek might try it. try to create some standardized wording. Split on white space, replace MD with DR in name fields (you will build a list of others). In the Address fields, replace NORTH with N., South and So with S. and so on.

        Now compare for exact match for words and numbers.

        Store "standardized" name and address in main database for future compares.

        You have a good piece of work ahead of you no matter what method you choose - Good Luck!

        dageek

Re: Merge/Purge address data
by gwhite (Friar) on Nov 11, 2003 at 15:07 UTC

    I used to do this kind of stuff back in the day on mainframes. You need an intermediate step, where you break all the lines into individual elements. Assuming the majority are US addresses, the name line would be (prefix, first, middle, last, suffix, postfix), the street line would get broken down(box, street_name, direction, type, unit_number, unit_type), city line is broken to (city, state, zip, zip+4), so your examples (for the street line) above become

    123D, main, north, street,,,
    123, main, north, street, D, apt,
    123, main, north, street, D,,

    I added street to the type, those will be street, avenue, court, circle, way, etc. Also, N., N and other direction abbreviations would need to be standardized. This applies to the prefix, suffix and postfix portions of the name as well. Once all that is done, sort by zip, box, street_name, direction, type, unit_number, unit_type, and last(name). Then it is pretty easy to score matches as loose or tight as you want. The USPS has addressing standards and guidelines which will help you clean up the list. The results match your effort in rooting out the abbreviations.

    g_White
      the street line would get broken down(box, street_name, direction, type, unit_number, unit_type)
      Be careful here. In Seattle (and I think DC and a few other areas), the street of "123rd Street NE" is perpendicular to "NE 123rd Street". That is, the "NE" modifier moves after or before the street name to show whether it's a north-south street or an east-west street. Yes, you can live at the intersection of those two streets!

      And, my former business address was 0333 SW Flower St, to distinguish it from 333 SW Flower St. The "0" in front essentially means "negative", so the "03" block is east of the "02" block, then "01" block then "1" block, "2" block, and the "3" block where the other address was, about six blocks west. So you can't just rip leading 0's either.

      So, be careful not to oversimplify.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        Realistically, how many addresses does that happen? 500-1000, in each city? Keep in mind that we are looking for duplicates, so all those exceptions would have to happen plus the match of the same last name (at least that would be a strong match point in my calculation), with the worst thing that could happen is that you don't send your pretty newsletter or mail order catalog to one of the two individuals.

        In this instance program for the obvious and let the exceptions fall wherever. It is not worth the programming time or effort for the 100,000 exception addresses in the millions of US addresses.

        g_White
Re: Merge/Purge address data
by Anonymous Monk on Nov 11, 2003 at 13:12 UTC
    The ultimate solution sounds quite complex, you may find it helpful to use one of these CPAN modules in processing your data as part of a bigger solution: http://search.cpan.org/~kimryan/
    Lingua::EN::AddressParse
    Lingua::EN::NameParse
Re: Merge/Purge address data
by TomDLux (Vicar) on Nov 11, 2003 at 18:06 UTC

    Procecessing duplicate manually entered free-form records is hard to do by hand, and darn near impossible by machine.

    It's hard enough parsing names, and streets, sometimes even figuring out what city someone is in can be challenging.

    I would being by parsing out the street name and number, and the person's last name. Use those to generate a similarity index, for people at the same address, and flag those with a match exceeding some command-line specified value. Then you can examine those personally, and make a decision.

    Mind you, my father was Thomas T. Legrady, and so am I. I occassionally get people who think I've been dead for ten years, but I insist on continuing to breath.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      And the IRS has three times confused me with my father, who shares the same name (and, yes, we long ago shared the same address). He is not a US citizen, however. My theory is that, confronted with two similar records, one with a SSN and one without, they assume the one without is just missing it, and merge them.

      To the OP - this is a very difficult problem, IMHO. So you need to think about what you can do if you have a low confidence in your match - is the information still useful?

      There may be also be many times when you have two datasets that appear to match, but in reality refer to very different beings. I guess it all depends on your data.

      There are various tools to canonicalize an address, but I don't know of any free ones that aren't for personal use only. See http://www.cedar.buffalo.edu/adserv.html