I used to do this kind of stuff back in the day on mainframes. You need an intermediate step, where you break all the lines into individual elements. Assuming the majority are US addresses, the name line would be (prefix, first, middle, last, suffix, postfix), the street line would get broken down(box, street_name, direction, type, unit_number, unit_type), city line is broken to (city, state, zip, zip+4), so your examples (for the street line) above become
123D, main, north, street,,,
123, main, north, street, D, apt,
123, main, north, street, D,,
I added street to the type, those will be street, avenue, court, circle, way, etc. Also, N., N and other direction abbreviations would need to be standardized. This applies to the prefix, suffix and postfix portions of the name as well. Once all that is done, sort by zip, box, street_name, direction, type, unit_number, unit_type, and last(name). Then it is pretty easy to score matches as loose or tight as you want. The USPS has addressing standards and guidelines which will help you clean up the list. The results match your effort in rooting out the abbreviations.
In reply to Re: Merge/Purge address data
by gwhite
in thread Merge/Purge address data
by cleverett
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |