in reply to Merge/Purge address data

I used to do this kind of stuff back in the day on mainframes. You need an intermediate step, where you break all the lines into individual elements. Assuming the majority are US addresses, the name line would be (prefix, first, middle, last, suffix, postfix), the street line would get broken down(box, street_name, direction, type, unit_number, unit_type), city line is broken to (city, state, zip, zip+4), so your examples (for the street line) above become

123D, main, north, street,,,
123, main, north, street, D, apt,
123, main, north, street, D,,

I added street to the type, those will be street, avenue, court, circle, way, etc. Also, N., N and other direction abbreviations would need to be standardized. This applies to the prefix, suffix and postfix portions of the name as well. Once all that is done, sort by zip, box, street_name, direction, type, unit_number, unit_type, and last(name). Then it is pretty easy to score matches as loose or tight as you want. The USPS has addressing standards and guidelines which will help you clean up the list. The results match your effort in rooting out the abbreviations.

g_White

Replies are listed 'Best First'.
•Re: Re: Merge/Purge address data
by merlyn (Sage) on Nov 11, 2003 at 15:33 UTC
    the street line would get broken down(box, street_name, direction, type, unit_number, unit_type)
    Be careful here. In Seattle (and I think DC and a few other areas), the street of "123rd Street NE" is perpendicular to "NE 123rd Street". That is, the "NE" modifier moves after or before the street name to show whether it's a north-south street or an east-west street. Yes, you can live at the intersection of those two streets!

    And, my former business address was 0333 SW Flower St, to distinguish it from 333 SW Flower St. The "0" in front essentially means "negative", so the "03" block is east of the "02" block, then "01" block then "1" block, "2" block, and the "3" block where the other address was, about six blocks west. So you can't just rip leading 0's either.

    So, be careful not to oversimplify.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Realistically, how many addresses does that happen? 500-1000, in each city? Keep in mind that we are looking for duplicates, so all those exceptions would have to happen plus the match of the same last name (at least that would be a strong match point in my calculation), with the worst thing that could happen is that you don't send your pretty newsletter or mail order catalog to one of the two individuals.

      In this instance program for the obvious and let the exceptions fall wherever. It is not worth the programming time or effort for the 100,000 exception addresses in the millions of US addresses.

      g_White
        I'm pretty sure there's a lot more than 1000 houses in the Seattle area. You can't just move the NE designator around. It's not the same address any more.

        So, the position of every piece of it is important. I would be mad if you had "normalized" my 0333 SW Flower address to "333 SW Flower". And yes, it happens, and it's still wrong.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.