in reply to Extracting a (UK) Address

Seems that for this kind of task, you are better off with a state machine. Something that allows you to identify an interesting section of your document which can be analysed / scrutinised for the interesting stuff. So the problem will be to identify keywords that mark the start of an address field and to have an idea of how an address field ends. If you're lucky, the address field has a fixed number of lines.

Something along:

use strict; sub flush_address { # may need more code to narrow down the address print "FOUND:\n>> ", join(">> ", @{$_[0]}), "======\n"; } my @address; my $line_span = -1; # -1 disabled; otherwise extract $line_span lines while (<>) { # identify start of an address section (upd.: regexp incomplete) push(@address,$_), next if /^\s*(Miss|Mister|Mr\.?|Ms\.?|Her|His)\s +/; if (@address) { push @address, $_; # identify end of an address section # regexp matches empty line here but should match something # like "BN2 ..." if (@address == $line_span or /^\s*$/) { flush_address(\@address); @address = (); } } } flush_address(\@address) if @address; __END__ pb> perl 733738.pl invoice.txt FOUND: >> Miss ***** ****** >> 1** Elm ****, >> Bri***** >> E*** ****** >> BN2 *** >> ======

Some free associations of potentially useful links: Parsing addresses, Efficient Fuzzy Matching Of An Address, Extracting Bibliography Citations, validate a postal code, Extracting information from a MS WORD Document, Pull all text from msword document, ...

Update: ... I assumed, that the sample invoice was anonymised already, but - just in case - made the sample output unreadable.

Replies are listed 'Best First'.
Re^2: Extracting a (UK) Address
by shrdlu (Novice) on Jan 02, 2009 at 21:05 UTC
    Studying the invoice closely, it appears that a unique comma terminates the first line of the address. If this is constant, you could possibly 'vector' yourself in from there?

    (Just a stray thought - does Miss Hocker mind her name and home address being published on the web?!)

      shrdlu said:
      (Just a stray thought - does Miss Hocker mind her name and home address being published on the web?!)

      I'm glad to hear that I'm not the only one with this worry.

      @ ropey: you should really dummy up the name and address here.

      Haha if someone lives at that address I would be very suprised.. it was made up