comment on

Hmmm ... well probably not faster but at least more accurate ...

If they're US addresses (and zip code makes me think so), you could use the USPS web service for this. They're are limits (5 requests per transaction) and it's going to be slow -- but at least they'll be correct (especially if you're goal is to *use* the address data to send mail!).

Once the addresses are standardarized -- I would then create a new table where contact_name is not part of the unique constraint and see what happens when you load the data. If it appears the names are mis-spelled or truncated or typo-ed, well, then your biggest problem is which one to choose. If there are multiple distinct names per address and you wish to keep those then I would add them back in *after* the initial load (and after altering the table to put contact_name back in as a unique constraint).

-derby

In reply to Re: Question: practical way to find dupes in a large dataset by derby
in thread Question: practical way to find dupes in a large dataset by lihao

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.