Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Question: practical way to find dupes in a large dataset

by lihao (Monk)
on Dec 21, 2010 at 22:58 UTC ( [id://878369]=perlquestion: print w/replies, xml ) Need Help??

lihao has asked for the wisdom of the Perl Monks concerning the following question:

Hello, monks:

I have a MySQL table containing about 20M+ mailing addresses (2GB in size, the original data were from outsource and contain lots of craps). I need to filter out as many duplicate records as I can from this table, which might be generated from misspelling, typo and truncating etc. Currently there is one unique key constraint in this table (contact_name, street, street2, city, state) which can exclude records exactly matched in the combination of these fields, so dupes from the exact matching are not a problem.

I tried String::Approx, and used zip_code to find an array of records that amatch() can check against. This approach is very slow and not enough for my needs.

Besides Soundex, are there any other algorithms that are practical and I can use to find potential dupes from a large data set.

Many thanks

lihao

  • Comment on Question: practical way to find dupes in a large dataset

Replies are listed 'Best First'.
Re: Question: practical way to find dupes in a large dataset
by BrowserUk (Patriarch) on Dec 22, 2010 at 10:54 UTC
    are there any other algorithms that are practical and I can use to find potential dupes from a large data set.

    With any of the purely algorithmic string similarity algorithms, you will have to decide through some means upon a 'breakpoint' above (or below) which you will program your code to accept or reject a pair of items as duplicates.

    And wherever you set that breakpoint, it will produce some level of false positives; and some level of false negatives. This is inevitable.

    For example, depending upon how you perform the similarity tests--eg. element by element, or entire address string against entire address string--in my home town these 3 addresses "11, New Street, Town, POST CODE"; "11, New St, Town, POST CODE"; & "11, New Road, Town, POST CODE"; might all be algorithmically judged to be very likely to be duplicates. Two of them are, one is not. Despite that the latter would likely be judge more similar than then true duplicate.

    And therein lies your problem. If you pick your breakpoint to be too sensitive, then you'll rejecting 'good' addresses as duplicates. Pick it too far the other way and you'll never reject anything.

    Three possible approaches come to mind:

    1. You might run a few trails on a sample of your data adjusting the break point manually until, by manual inspection, you are achieving a level of deduplication commensurate with your needs.

      And then run it automatically on the rest of your data and hope that your sample was a good one.

    2. Set the sensitivity quite high, and then present the "duplicates" for manual verification before removal.
    3. Train a Neural Net to make the decisions.

      The up front costs of setting up and then training a NN would probably be quite high, but done well, they can be come very effective and reusable.

      One tip from experience. After training, have the NN throw (say) 1 in every thousand 'close calls' it encounters out to a human being for final judgement. In this way, the NN gets additional training material and so gets better over time.

      It's also worth keeping 'rejects' from the first pass and feeding them back in to the mix at subsequent passes to see how things evolve.

    In the end, there is no substitute for the human brain in making this kind of decision. If the data set is valuable enough, and likely to be reused many times, the manual route may be the cheapest option.

    If you are going to have to repeat the process over many times, then the Neural Net approach would likely be the most cost effective in the long run.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Question: practical way to find dupes in a large dataset
by derby (Abbot) on Dec 22, 2010 at 12:10 UTC

    Hmmm ... well probably not faster but at least more accurate ...

    If they're US addresses (and zip code makes me think so), you could use the USPS web service for this. They're are limits (5 requests per transaction) and it's going to be slow -- but at least they'll be correct (especially if you're goal is to *use* the address data to send mail!).

    Once the addresses are standardarized -- I would then create a new table where contact_name is not part of the unique constraint and see what happens when you load the data. If it appears the names are mis-spelled or truncated or typo-ed, well, then your biggest problem is which one to choose. If there are multiple distinct names per address and you wish to keep those then I would add them back in *after* the initial load (and after altering the table to put contact_name back in as a unique constraint).

    -derby
Re: Question: practical way to find dupes in a large dataset
by Anonymous Monk on Dec 22, 2010 at 05:26 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://878369]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-04-19 10:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found