in reply to Merge/Purge address data

Before you start doing this, I think you also need to consider the country from which the address info comes. Your examples are typically American. Addresses are done differently within the EU, and still differently in India, Japan and China.

One example I think might trip you up is the title Drs., which in the Netherlands means "Person working on his first doctorate thesis" and in Germany means "Person having completed multiple doctorate theses".

Ok, maybe the example is a bit artificial, but it should give you an indication of what you're getting yourself into.

Good luck! And make it a CPAN module so that others may benefit!

Liz

Update:
I just thought of a typical address in the Netherlands:

Admiralengracht t.o. 281

The "t.o. is Dutch for "tegenover", which means "opposite". It indicates an address of a house boat, opposite of the house numbered "281" on the "Admiralengracht".

Replies are listed 'Best First'.
Re: Re: Merge/Purge address data
by cleverett (Friar) on Nov 11, 2003 at 09:27 UTC

    And you should see what they do in Thailand!

    Not to mention how a naive person from a developing country might write any type of address ...

    I believe I need to limit the problem space to US addresses, rather than solve the problem for the global address space ...

    But as far as I can tell, a large enough training set would enable a Bayesian or fuzzy solution to distinguish 'twixt Dutch and German 'Drs.'

    PS. How uncanny! My younger sister has a linguistics doctorate, her first language is German, her second Dutch!