It's very unpredictable what we get.

OTOH, we can get reams of seemingly unrelated data, such as IP addresses what might help figure out what's happening.

And difficult to distinguish and search for "similar" addresses. I thought Bayesian filters or fuzzy logic might provide a means to that end.

The address data are left by users via web form.

Depending on if they clean out their cookies or use a different computer or the phase of the moon they may or may not end up creating a duplicate address record.

Even having the same email, wouldn't make 2 records necessarily the same person. In our database, we have 50 husband-wife pairs sharing the same email address for example. With one pair, their record are different only in two different respects: A different first name and a different CV.

After thinking about it most of the night, what I think I need to accomplish is:

  1. develop a way of measuring how similar two strings are to each other, where 0 means nothing in common and 1 means identity.
  2. tweak the above a bit for different columns ... 123 Main Street and 123 Elm Street are absolutely different addresses
  3. look at what the level of similarity each column in 2 different rows says about how similar the records are as a whole
  4. develop a metric for measuring the quality of data so that I can select "Alan F. Balfour" over "A Balfour" when I merge the data
  5. figure out how to cut down the number of candidates my algorithm attempts matching a particular row to

The last looks tricky to me, I might be able to live without it.


In reply to Re: Re: Merge/Purge address data by cleverett
in thread Merge/Purge address data by cleverett

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.