in reply to Merge/Purge address data
I used to do this kind of stuff back in the day on mainframes. You need an intermediate step, where you break all the lines into individual elements. Assuming the majority are US addresses, the name line would be (prefix, first, middle, last, suffix, postfix), the street line would get broken down(box, street_name, direction, type, unit_number, unit_type), city line is broken to (city, state, zip, zip+4), so your examples (for the street line) above become
123D, main, north, street,,,
123, main, north, street, D, apt,
123, main, north, street, D,,
I added street to the type, those will be street, avenue, court, circle, way, etc. Also, N., N and other direction abbreviations would need to be standardized. This applies to the prefix, suffix and postfix portions of the name as well. Once all that is done, sort by zip, box, street_name, direction, type, unit_number, unit_type, and last(name). Then it is pretty easy to score matches as loose or tight as you want. The USPS has addressing standards and guidelines which will help you clean up the list. The results match your effort in rooting out the abbreviations.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
•Re: Re: Merge/Purge address data
by merlyn (Sage) on Nov 11, 2003 at 15:33 UTC | |
by gwhite (Friar) on Nov 11, 2003 at 21:16 UTC | |
by merlyn (Sage) on Nov 11, 2003 at 21:20 UTC | |
by gwhite (Friar) on Nov 11, 2003 at 21:29 UTC | |
by merlyn (Sage) on Nov 11, 2003 at 21:36 UTC | |
|