Gangabass has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys

I need to parse addresses. But unfortunately the address string is simple text string without any separators. Here are some examples:

I must get from this string Address, City and Postal code. So for given strings it will be:

It's not so hard to get Postal code:

my ($postal) = $dirty_address =~ m/(\w+){2}$/;

My main problem is separating city name and street name (cause city name maybe several words). Can you suggest something to me? Maybe there is module that do such things?

It this moment my only idea is to look to the first occurence of the street prefix (St, Rd, Drv) from Postal code. When i find it -- part from prefix till Postal code is city name. But I don't know all prefixes :-(

P.S. This is Yellowpages search results.

Replies are listed 'Best First'.
Re: Parsing addresses
by bruceb3 (Pilgrim) on Sep 19, 2007 at 09:57 UTC
    Here is some code that handles the two cases supplied. This could be easily broken. You would have to make sure that all combinations of the types of roads or streets were covered.
    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my @addresses = ( "305 Ingham Rd Garbutt QLD 4814", "Castletown Shoppingworld, Cnr Kings Rd & Woolcock St Hyde Park QL +D 4812", ); for my $address (@addresses) { $address =~ /(.+) (St|Rd|Crs) (.+) (\w+) (\d+)$/; my ($street, $suburb, $state, $postcode) = ($1." ".$2, $3, $4, $5) +; print "'$street' '$suburb' '$state' '$postcode'\n"; }
    Produces this output-
    '305 Ingham Rd' Garbutt' 'QLD' '4814' 'Castletown Shoppingworld, Cnr Kings Rd & Woolcock St' 'Hyde Park' 'QL +D' '4812'

    Use at your own risk.

Re: Parsing addresses
by apl (Monsignor) on Sep 19, 2007 at 09:55 UTC
    It sounds like you're required to keep updating that street prefix list every time you find a new one during testing.

    You're also going to have to be careful in your Cnr / & processing. That's another pair of terminals that will help you break up the text. (By this I mean: whatever comes before the & is another prefix (or part of a prefix). )
Re: Parsing addresses
by Burak (Chaplain) on Sep 19, 2007 at 10:49 UTC
    What you want is not that simple. Maybe one of the Geo::Coder modules can help?
Re: Parsing addresses
by moritz (Cardinal) on Sep 19, 2007 at 09:28 UTC
    This is ambigous, but you can use the HTML.

    A search for Google yields

    1600 Amphitheatre Pkwy <br />Mountain View, CA 94043

    That <br /> will help you a lot...

Re: Parsing addresses
by tweetiepooh (Hermit) on Sep 19, 2007 at 12:18 UTC
    One idea is to remove what you do know first somehow then parse the remainder.

    As said the postal code can be shuffled off first if the format is constant.

    Next create a list of towns, this will probably be shorter than a list of streets. Now use this list to remove the town/city from the data.

    Parsing out the street bits may be more tricky if it's multipart and with business names, street numbers, street names and combinations there of. What would you do with an address like

    Some business Unit n Sometown retail park Some street Some district Sometown Postal code
Re: Parsing addresses
by Gangabass (Vicar) on Sep 20, 2007 at 00:47 UTC