in reply to Question: practical way to find dupes in a large dataset

are there any other algorithms that are practical and I can use to find potential dupes from a large data set.

With any of the purely algorithmic string similarity algorithms, you will have to decide through some means upon a 'breakpoint' above (or below) which you will program your code to accept or reject a pair of items as duplicates.

And wherever you set that breakpoint, it will produce some level of false positives; and some level of false negatives. This is inevitable.

For example, depending upon how you perform the similarity tests--eg. element by element, or entire address string against entire address string--in my home town these 3 addresses "11, New Street, Town, POST CODE"; "11, New St, Town, POST CODE"; & "11, New Road, Town, POST CODE"; might all be algorithmically judged to be very likely to be duplicates. Two of them are, one is not. Despite that the latter would likely be judge more similar than then true duplicate.

And therein lies your problem. If you pick your breakpoint to be too sensitive, then you'll rejecting 'good' addresses as duplicates. Pick it too far the other way and you'll never reject anything.

Three possible approaches come to mind:

  1. You might run a few trails on a sample of your data adjusting the break point manually until, by manual inspection, you are achieving a level of deduplication commensurate with your needs.

    And then run it automatically on the rest of your data and hope that your sample was a good one.

  2. Set the sensitivity quite high, and then present the "duplicates" for manual verification before removal.
  3. Train a Neural Net to make the decisions.

    The up front costs of setting up and then training a NN would probably be quite high, but done well, they can be come very effective and reusable.

    One tip from experience. After training, have the NN throw (say) 1 in every thousand 'close calls' it encounters out to a human being for final judgement. In this way, the NN gets additional training material and so gets better over time.

    It's also worth keeping 'rejects' from the first pass and feeding them back in to the mix at subsequent passes to see how things evolve.

In the end, there is no substitute for the human brain in making this kind of decision. If the data set is valuable enough, and likely to be reused many times, the manual route may be the cheapest option.

If you are going to have to repeat the process over many times, then the Neural Net approach would likely be the most cost effective in the long run.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: Question: practical way to find dupes in a large dataset