in reply to Re: •Re: Re: Merge/Purge address data
in thread Merge/Purge address data

I'm pretty sure there's a lot more than 1000 houses in the Seattle area. You can't just move the NE designator around. It's not the same address any more.

So, the position of every piece of it is important. I would be mad if you had "normalized" my 0333 SW Flower address to "333 SW Flower". And yes, it happens, and it's still wrong.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

  • Comment on •Re: Re: •Re: Re: Merge/Purge address data

Replies are listed 'Best First'.
Re: •Re: Re: •Re: Re: Merge/Purge address data
by gwhite (Friar) on Nov 11, 2003 at 21:29 UTC

    Yes Randal, there are more than 1000 houses in Seattle, but probably not a lot more than a 1000 with an exact duplicate address and street with the exception being a leading zero. AND I never recommended normalizing the address that was sent to the user, the normalization and extraction is for merge/purge process only, you _should_ always keep the orginal input as the actual address you stick on the mailing label.

    g_White
      Ooops, you're confusing Portland and Seattle. There are tons of houses in seattle where moving the NE will break things. In Portland, the number of leading-0 addresses is more like 500 or so, so that's a smaller problem, but still a problem.

      But the real problem is that you cannot coalesce like this. You cannot know that "123D Main" is really the same as "123 Main, Apt D". They might not be. And if you join them, you might break things.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        I am not advocating moving anything. To determine duplicates in a mail list, break it into its individual parts, score the matching parts, if your score is above the threshold assume a duplicate, save one of your original inputs (not the remix of parts).
        You cannot know that "123D Main" is really the same as "123 Main, Apt D".
        You cannot know based on that info only, but if I also have a matching zip, matching last name, matching first name, I may _choose_ to say that is a match, expecially if I am sending an expensive 4 color catalog at 80 cents postage per catalog. If I am sending a presorted one color postcard at the lowest rate (20 something cents I think), maybe I choose to say it is not a duplicate.

        g_White