Re: Extracting a (UK) Address

Seems that for this kind of task, you are better off with a state machine. Something that allows you to identify an interesting section of your document which can be analysed / scrutinised for the interesting stuff. So the problem will be to identify keywords that mark the start of an address field and to have an idea of how an address field ends. If you're lucky, the address field has a fixed number of lines.

Something along:

use strict;

sub flush_address {
  # may need more code to narrow down the address
  print "FOUND:\n>> ", join(">> ", @{$_[0]}), "======\n";
}

my @address;

my $line_span = -1; # -1 disabled; otherwise extract $line_span lines

while (<>) {
   # identify start of an address section (upd.: regexp incomplete)
   push(@address,$_), next if /^\s*(Miss|Mister|Mr\.?|Ms\.?|Her|His)\s
+/;
   if (@address) {
        push @address, $_;
        # identify end of an address section
        # regexp matches empty line here but should match something 
        # like "BN2 ..."
        if (@address == $line_span or /^\s*$/) {
          flush_address(\@address);
          @address = ();
        }
   }
}

flush_address(\@address) if @address;

__END__
pb> perl 733738.pl invoice.txt
FOUND:
>> Miss ***** ******
>> 1** Elm ****,
>> Bri*****
>> E*** ******
>> BN2 ***
>>
======
[download]

Some free associations of potentially useful links: Parsing addresses, Efficient Fuzzy Matching Of An Address, Extracting Bibliography Citations, validate a postal code, Extracting information from a MS WORD Document, Pull all text from msword document, ...

Update: ... I assumed, that the sample invoice was anonymised already, but - just in case - made the sample output unreadable.

Comment on Re: Extracting a (UK) Address Download Code

Replies are listed 'Best First'.
Re^2: Extracting a (UK) Address by shrdlu (Novice) on Jan 02, 2009 at 21:05 UTC
Studying the invoice closely, it appears that a unique comma terminates the first line of the address. If this is constant, you could possibly 'vector' yourself in from there? (Just a stray thought - does Miss Hocker mind her name and home address being published on the web?!)	[reply]
Re^3: Extracting a (UK) Address by telemachus (Friar) on Jan 03, 2009 at 01:32 UTC
shrdlu said: (Just a stray thought - does Miss Hocker mind her name and home address being published on the web?!) I'm glad to hear that I'm not the only one with this worry. @ ropey: you should really dummy up the name and address here.	[reply]
Re^3: Extracting a (UK) Address by ropey (Hermit) on Jan 23, 2009 at 16:57 UTC
Haha if someone lives at that address I would be very suprised.. it was made up	[reply]