Hello, monks:
I have a MySQL table containing about 20M+ mailing addresses (2GB in size, the original data were from outsource and contain lots of craps). I need to filter out as many duplicate records as I can from this table, which might be generated from misspelling, typo and truncating etc. Currently there is one unique key constraint in this table (contact_name, street, street2, city, state) which can exclude records exactly matched in the combination of these fields, so dupes from the exact matching are not a problem.
I tried String::Approx, and used zip_code to find an array of records that amatch() can check against. This approach is very slow and not enough for my needs.
Besides Soundex, are there any other algorithms that are practical and I can use to find potential dupes from a large data set.
Many thanks
lihao
In reply to Question: practical way to find dupes in a large dataset by lihao
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |