It's very unpredictable what we get.
OTOH, we can get reams of seemingly unrelated data, such as IP addresses what might help figure out what's happening.
And difficult to distinguish and search for "similar"
addresses. I thought Bayesian filters or fuzzy logic might provide a means to that end.
The address data are left by users via web form.
Depending on if they clean out their cookies or use a different computer or the phase of the moon they may or may not end up creating a duplicate address record.
Even having the same email, wouldn't make 2 records necessarily the same person. In our database, we have 50 husband-wife pairs sharing the same email address for
example. With one pair, their record are different only in
two different respects: A different first name and a different CV.
After thinking about it most of the night, what I think I need to accomplish is:
- develop a way of measuring how similar two strings are to each other, where 0 means nothing in common and 1 means identity.
- tweak the above a bit for different columns ... 123 Main Street and 123 Elm Street are absolutely different addresses
- look at what the level of similarity each column in 2 different rows says about how similar the records are as a whole
- develop a metric for measuring the quality of data so that I can select "Alan F. Balfour" over "A Balfour" when I merge the data
- figure out how to cut down the number of candidates my algorithm attempts matching a particular row to
The last looks tricky to me, I might be able to live without it.
|