in reply to Parse mailing addresses with a regex

I strongly recommend getting a database. It will make life a lot easier in the end, otherwise any hack you have for this wil be broken.

However if the data is very uniform, you can just wildcard the name and rely on everything else to lock down the position. Like this untested RE: /^(\d+)\s+(.*?)\s+(\d+.*?)\s+(\w\w)\s+(\d{3}-\d{3}-\S+)\s*(.*)/ The first capture will be the customer code, then the name, then street address, state, then telephone number (with allowance for extensions, as in 223-456-1234x56), then comment.

Looking at that again, a database would be far preferable. (If you don't do that, then add some validation checks. Because the data WILL entered badly, and that will be a constant battle to face.)

Replies are listed 'Best First'.
Re: Re: Parse mailing addresses with a regex
by ferrency (Deacon) on Jun 23, 2003 at 14:39 UTC
    While "getting a database" is a good idea, it may not solve this person's problem. The problem is, given a large volume of legacy, unparsed, free-form address data, how do you parse it to put it into the database in the first place?

    Unfortunately, that's difficult. Lingua::EN::AddressParse is good if you know what country the address information is for, but it isn't sufficient by itself if you also need to extract country codes from international address data.

    I'm actually about to solve a similar problem myself. If I can't find consistently exploitable patterns in the data, my next tactic will be using Lingua::EN::AddressParse in combination with state/zipcode verification to try to catch all the US addresses, and then to try to exploit patterns in the remaining (international) addresses that AddressParse can't parse effectively.

    Alan

      True. In that case, as you indicate, you try to avoid working with the legacy data. Instead you do multiple passes, in each pass you look for things that you can parse, and divide the data into stuff that you just figured out, and leftovers. After a few rounds, the number of leftovers hopefully becomes managable by hand, you load your database, and then go from there.

      If aquiring legacy data is an ongoing process, you can semi-automate this. But it would be unwise to try to avoid having the final manual pass. A 95% solution is easy. 99.5% is doable. 100% is pretty much impossible.

Re: Re: Parse mailing addresses with a regex
by BrowserUk (Patriarch) on Jun 23, 2003 at 14:42 UTC

    Even if he gets a database, won't he still need to parse the data in order to get it into the DB?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller