in reply to Extracting a (UK) Address
Seems that for this kind of task, you are better off with a state machine. Something that allows you to identify an interesting section of your document which can be analysed / scrutinised for the interesting stuff. So the problem will be to identify keywords that mark the start of an address field and to have an idea of how an address field ends. If you're lucky, the address field has a fixed number of lines.
Something along:
use strict; sub flush_address { # may need more code to narrow down the address print "FOUND:\n>> ", join(">> ", @{$_[0]}), "======\n"; } my @address; my $line_span = -1; # -1 disabled; otherwise extract $line_span lines while (<>) { # identify start of an address section (upd.: regexp incomplete) push(@address,$_), next if /^\s*(Miss|Mister|Mr\.?|Ms\.?|Her|His)\s +/; if (@address) { push @address, $_; # identify end of an address section # regexp matches empty line here but should match something # like "BN2 ..." if (@address == $line_span or /^\s*$/) { flush_address(\@address); @address = (); } } } flush_address(\@address) if @address; __END__ pb> perl 733738.pl invoice.txt FOUND: >> Miss ***** ****** >> 1** Elm ****, >> Bri***** >> E*** ****** >> BN2 *** >> ======
Some free associations of potentially useful links: Parsing addresses, Efficient Fuzzy Matching Of An Address, Extracting Bibliography Citations, validate a postal code, Extracting information from a MS WORD Document, Pull all text from msword document, ...
Update: ... I assumed, that the sample invoice was anonymised already, but - just in case - made the sample output unreadable.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Extracting a (UK) Address
by shrdlu (Novice) on Jan 02, 2009 at 21:05 UTC | |
by telemachus (Friar) on Jan 03, 2009 at 01:32 UTC | |
by ropey (Hermit) on Jan 23, 2009 at 16:57 UTC |