in reply to Re: Merge/Purge address data
in thread Merge/Purge address data

It's very unpredictable what we get.

OTOH, we can get reams of seemingly unrelated data, such as IP addresses what might help figure out what's happening.

And difficult to distinguish and search for "similar" addresses. I thought Bayesian filters or fuzzy logic might provide a means to that end.

The address data are left by users via web form.

Depending on if they clean out their cookies or use a different computer or the phase of the moon they may or may not end up creating a duplicate address record.

Even having the same email, wouldn't make 2 records necessarily the same person. In our database, we have 50 husband-wife pairs sharing the same email address for example. With one pair, their record are different only in two different respects: A different first name and a different CV.

After thinking about it most of the night, what I think I need to accomplish is:

  1. develop a way of measuring how similar two strings are to each other, where 0 means nothing in common and 1 means identity.
  2. tweak the above a bit for different columns ... 123 Main Street and 123 Elm Street are absolutely different addresses
  3. look at what the level of similarity each column in 2 different rows says about how similar the records are as a whole
  4. develop a metric for measuring the quality of data so that I can select "Alan F. Balfour" over "A Balfour" when I merge the data
  5. figure out how to cut down the number of candidates my algorithm attempts matching a particular row to

The last looks tricky to me, I might be able to live without it.

Replies are listed 'Best First'.
Re: Re: Re: Merge/Purge address data
by EvdB (Deacon) on Nov 11, 2003 at 11:27 UTC
    Depending on how many addresses you have and if this is a one off you might be well off doing 4 yourself. You could write a script that would do 1, 2 and 3 and then interactively ask you whether to merge and what to merge.

    In fact if you did this then you could add bits of code to do 4 as you went along. This way you will see the problems that are cropping up and will have a good idea of what is required to fix them.

    Added: As I understand it baysian filters learn from experience. Maybe you could get the filter to look at you decisions above and learn from them. Potential to wander off into AI and expert systems here.

    --tidiness is the memory loss of environmental mnemonics

      I could cache the results of 5. When doing a match, the first likely match would lead to its known dupes.

      Added: As I understand it baysian filters learn from experience. Maybe you could get the filter to look at you decisions above and learn from them. Potential to wander off into AI and expert systems here.

      That wouldn't be so bad ...

      EvdB said:
      Depending on how many addresses you have and if this is a one off you might be well off doing 4 yourself. You could write a script that would do 1, 2 and 3 and then interactively ask you whether to merge and what to merge.

      Didn't catch that at first. Actually, I'd want to run it daily ... no hourly ... make that as often as I can.