are there any other algorithms that are practical and I can use to find potential dupes from a large data set.

With any of the purely algorithmic string similarity algorithms, you will have to decide through some means upon a 'breakpoint' above (or below) which you will program your code to accept or reject a pair of items as duplicates.

And wherever you set that breakpoint, it will produce some level of false positives; and some level of false negatives. This is inevitable.

For example, depending upon how you perform the similarity tests--eg. element by element, or entire address string against entire address string--in my home town these 3 addresses "11, New Street, Town, POST CODE"; "11, New St, Town, POST CODE"; & "11, New Road, Town, POST CODE"; might all be algorithmically judged to be very likely to be duplicates. Two of them are, one is not. Despite that the latter would likely be judge more similar than then true duplicate.

And therein lies your problem. If you pick your breakpoint to be too sensitive, then you'll rejecting 'good' addresses as duplicates. Pick it too far the other way and you'll never reject anything.

Three possible approaches come to mind:

  1. You might run a few trails on a sample of your data adjusting the break point manually until, by manual inspection, you are achieving a level of deduplication commensurate with your needs.

    And then run it automatically on the rest of your data and hope that your sample was a good one.

  2. Set the sensitivity quite high, and then present the "duplicates" for manual verification before removal.
  3. Train a Neural Net to make the decisions.

    The up front costs of setting up and then training a NN would probably be quite high, but done well, they can be come very effective and reusable.

    One tip from experience. After training, have the NN throw (say) 1 in every thousand 'close calls' it encounters out to a human being for final judgement. In this way, the NN gets additional training material and so gets better over time.

    It's also worth keeping 'rejects' from the first pass and feeding them back in to the mix at subsequent passes to see how things evolve.

In the end, there is no substitute for the human brain in making this kind of decision. If the data set is valuable enough, and likely to be reused many times, the manual route may be the cheapest option.

If you are going to have to repeat the process over many times, then the Neural Net approach would likely be the most cost effective in the long run.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Question: practical way to find dupes in a large dataset by BrowserUk
in thread Question: practical way to find dupes in a large dataset by lihao

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.