in reply to •Re: Re: •Re: Re: Merge/Purge address data
in thread Merge/Purge address data

Yes Randal, there are more than 1000 houses in Seattle, but probably not a lot more than a 1000 with an exact duplicate address and street with the exception being a leading zero. AND I never recommended normalizing the address that was sent to the user, the normalization and extraction is for merge/purge process only, you _should_ always keep the orginal input as the actual address you stick on the mailing label.

g_White
  • Comment on Re: •Re: Re: •Re: Re: Merge/Purge address data

Replies are listed 'Best First'.
•Re: Re: •Re: Re: •Re: Re: Merge/Purge address data
by merlyn (Sage) on Nov 11, 2003 at 21:36 UTC
    Ooops, you're confusing Portland and Seattle. There are tons of houses in seattle where moving the NE will break things. In Portland, the number of leading-0 addresses is more like 500 or so, so that's a smaller problem, but still a problem.

    But the real problem is that you cannot coalesce like this. You cannot know that "123D Main" is really the same as "123 Main, Apt D". They might not be. And if you join them, you might break things.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I am not advocating moving anything. To determine duplicates in a mail list, break it into its individual parts, score the matching parts, if your score is above the threshold assume a duplicate, save one of your original inputs (not the remix of parts).
      You cannot know that "123D Main" is really the same as "123 Main, Apt D".
      You cannot know based on that info only, but if I also have a matching zip, matching last name, matching first name, I may _choose_ to say that is a match, expecially if I am sending an expensive 4 color catalog at 80 cents postage per catalog. If I am sending a presorted one color postcard at the lowest rate (20 something cents I think), maybe I choose to say it is not a duplicate.

      g_White