I used to do this kind of stuff back in the day on mainframes. You need an intermediate step, where you break all the lines into individual elements. Assuming the majority are US addresses, the name line would be (prefix, first, middle, last, suffix, postfix), the street line would get broken down(box, street_name, direction, type, unit_number, unit_type), city line is broken to (city, state, zip, zip+4), so your examples (for the street line) above become

123D, main, north, street,,,
123, main, north, street, D, apt,
123, main, north, street, D,,

I added street to the type, those will be street, avenue, court, circle, way, etc. Also, N., N and other direction abbreviations would need to be standardized. This applies to the prefix, suffix and postfix portions of the name as well. Once all that is done, sort by zip, box, street_name, direction, type, unit_number, unit_type, and last(name). Then it is pretty easy to score matches as loose or tight as you want. The USPS has addressing standards and guidelines which will help you clean up the list. The results match your effort in rooting out the abbreviations.

g_White

In reply to Re: Merge/Purge address data by gwhite
in thread Merge/Purge address data by cleverett

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.