comment on

Hello, monks:

I have a MySQL table containing about 20M+ mailing addresses (2GB in size, the original data were from outsource and contain lots of craps). I need to filter out as many duplicate records as I can from this table, which might be generated from misspelling, typo and truncating etc. Currently there is one unique key constraint in this table (contact_name, street, street2, city, state) which can exclude records exactly matched in the combination of these fields, so dupes from the exact matching are not a problem.

I tried String::Approx, and used zip_code to find an array of records that amatch() can check against. This approach is very slow and not enough for my needs.

Besides Soundex, are there any other algorithms that are practical and I can use to find potential dupes from a large data set.

Many thanks

lihao

In reply to Question: practical way to find dupes in a large dataset by lihao

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.