are there any other algorithms that are practical and I can use to find potential dupes from a large data set.
With any of the purely algorithmic string similarity algorithms, you will have to decide through some means upon a 'breakpoint' above (or below) which you will program your code to accept or reject a pair of items as duplicates.
And wherever you set that breakpoint, it will produce some level of false positives; and some level of false negatives. This is inevitable.
For example, depending upon how you perform the similarity tests--eg. element by element, or entire address string against entire address string--in my home town these 3 addresses "11, New Street, Town, POST CODE"; "11, New St, Town, POST CODE"; & "11, New Road, Town, POST CODE"; might all be algorithmically judged to be very likely to be duplicates. Two of them are, one is not. Despite that the latter would likely be judge more similar than then true duplicate.
And therein lies your problem. If you pick your breakpoint to be too sensitive, then you'll rejecting 'good' addresses as duplicates. Pick it too far the other way and you'll never reject anything.
Three possible approaches come to mind:
- You might run a few trails on a sample of your data adjusting the break point manually until, by manual inspection, you are achieving a level of deduplication commensurate with your needs.
And then run it automatically on the rest of your data and hope that your sample was a good one.
- Set the sensitivity quite high, and then present the "duplicates" for manual verification before removal.
- Train a Neural Net to make the decisions.
The up front costs of setting up and then training a NN would probably be quite high, but done well, they can be come very effective and reusable.
One tip from experience. After training, have the NN throw (say) 1 in every thousand 'close calls' it encounters out to a human being for final judgement. In this way, the NN gets additional training material and so gets better over time.
It's also worth keeping 'rejects' from the first pass and feeding them back in to the mix at subsequent passes to see how things evolve.
In the end, there is no substitute for the human brain in making this kind of decision. If the data set is valuable enough, and likely to be reused many times, the manual route may be the cheapest option.
If you are going to have to repeat the process over many times, then the Neural Net approach would likely be the most cost effective in the long run.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Hmmm ... well probably not faster but at least more accurate ...
If they're US addresses (and zip code makes me think so), you could use the USPS
web service for this. They're are limits (5 requests per transaction) and it's going to be slow -- but at least
they'll be correct (especially if you're goal is to *use* the address data to send mail!).
Once the addresses are standardarized -- I would then create a new table where contact_name
is not part of the unique constraint and see what happens when you load the data. If it appears
the names are mis-spelled or truncated or typo-ed, well, then your biggest problem is which one
to choose. If there are multiple distinct names per address and you wish to keep those then I would
add them back in *after* the initial load (and after altering the table to put contact_name back
in as a unique constraint).
| [reply] |