Re: Extracting a (UK) Address
by Perlbotics (Archbishop) on Jan 02, 2009 at 12:39 UTC
|
Seems that for this kind of task, you are better off with a state machine.
Something that allows you to identify an interesting section of your document which can be analysed / scrutinised for the interesting stuff.
So the problem will be to identify keywords that mark the start of an address field and to have an idea of how an address field ends. If you're lucky, the address field has a fixed number of lines.
Something along:
use strict;
sub flush_address {
# may need more code to narrow down the address
print "FOUND:\n>> ", join(">> ", @{$_[0]}), "======\n";
}
my @address;
my $line_span = -1; # -1 disabled; otherwise extract $line_span lines
while (<>) {
# identify start of an address section (upd.: regexp incomplete)
push(@address,$_), next if /^\s*(Miss|Mister|Mr\.?|Ms\.?|Her|His)\s
+/;
if (@address) {
push @address, $_;
# identify end of an address section
# regexp matches empty line here but should match something
# like "BN2 ..."
if (@address == $line_span or /^\s*$/) {
flush_address(\@address);
@address = ();
}
}
}
flush_address(\@address) if @address;
__END__
pb> perl 733738.pl invoice.txt
FOUND:
>> Miss ***** ******
>> 1** Elm ****,
>> Bri*****
>> E*** ******
>> BN2 ***
>>
======
Some free associations of potentially useful links:
Parsing addresses, Efficient Fuzzy Matching Of An Address, Extracting Bibliography Citations,
validate a postal code, Extracting information from a MS WORD Document, Pull all text from msword document, ...
Update: ... I assumed, that the sample invoice was
anonymised already, but - just in case - made the sample output unreadable.
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
|
|
Haha if someone lives at that address I would be very suprised.. it was made up
| [reply] |
Re: Extracting a (UK) Address
by Bloodnok (Vicar) on Jan 02, 2009 at 12:21 UTC
|
I may be teaching grannies to suck eggs, but here goes anyway ... IMO, you need to identify and use, one, or more, invariant properties of the address block e.g. ...
- Always in the same relative/absolute location in the invoice &/or ...
- Has an identifying header &/or ...
- .
- .
Even better is if, of the invariants thus identified, at least one can be demonstrated to be unique for the address block.
In your supplementary data example, it would appear that the address block is the 3rd block where each block is separated from the next by \n{2,}.
Having identified and isolated the address block, it then becomes a simpler matter of parsing the address details...
Thinx: OTOH, there may be a chance that the address is stored as a formatted block in the Word doc - so using Win32::OLE may be a first step to read the object direct from the doc...
Thinx again: It's highly probable that a combination of the 2 would be required to handle to inconsistencies introduced by the evolution of the doc...
A user level that continues to overstate my experience :-))
| [reply] [d/l] |
Re: Extracting a (UK) Address
by u671296 (Sexton) on Jan 02, 2009 at 10:38 UTC
|
Hi,
The solution will depend very much upon the data you need to extract the address from. Are there any delimiters ? fixed length fields ? It would help if you could include some example data, though I appreciate you may want to change it for security purposes.
| [reply] |
|
|
Format may vary from time to time (as the invoices have evolved) one example would be something like
Invoice
Invoice No: C0331-2008
Invoice Date:27/02/2008
VAT No: 679 7113 03
Miss Carol Hocker
177 Elm Road,
Brighton
East Sussex
BN2 7HB
DESCRIPTION
AMOUNT
TOTAL
Corian worktops supply and fit
£3083.15
Neff double oven
£599.00
Neff gas hob
£298.00
Baumatic extractor hood
£419.00
Neff dishwasher
£420.00
Ducting kit
£30.00
Franke swing spray tap
£175.00
Baumatic Microwave
£219.00
Double sockets x 4 £160.00
Single sockets x 3 £108.00
Fused spurs x 2 £ 72.00
Cooker control panel £ 55.00
5 triangle lights £120.00
Supply and fit new fuse board
The electrics will be invoiced by electrician and are plus vat
£5243.15
PAYMENTS RECEIVED
AMOUNT
TOTAL
Payment now due
£ 2184.02
| [reply] [d/l] |
|
|
So you are looking for three or more lines together, the last ending in something that looks like a post code...
$letter =~ m/((?:[^\n]+\n){2,}[^\n]*?[a-zA-Z]+[0-9]+\s+[0-9]+[a-zA-Z]+\s*?\n)\s*?\n/
...seemed to do the trick, where the entire letter was read into $letter. Obviously this will miss addresses with no post code or really rubbish post codes. You could just extract all groups of 3 or more lines, and then apply some more cunning address recogniser to the result -- perhaps from one of the modules recommended elsewhere.
(I haven't tried to figure out how much work this is asking the regex engine to do on difficult input. I'd worry about that only if it becomes a problem.)
| [reply] [d/l] [select] |
|
|
|
|
I have little to add to the other suggestions already made. I think it is unlikely you can home in on the address without it being delimited in some way.
The example above clearly delimits with "VAT No:" and "DESCRIPTION". I think you'll need to know what variations the invoices have had over the years and code for all of them.
Other tricks might help, e.g. is there always a titled name (Mr,Mrs,Miss,Ms etc.) at the start of the address ?, If so analyze all the names in your dataset to identify all unique titles. 177 Elm Road, is the only line that starts with a number so the address is in that block Address lines are the only ones that end with a comma, so use that block The address is always in the first n lines of the invoice ? The address always has a county in it ? etc.
Also if you have access to postcode validation (database ?) that could help
Whatever, I assume you will end up with many invoices that can't be correctly handled, so you'll need to agree how to handle those exceptions.
| [reply] |
|
|
|
|
Re: Extracting a (UK) Address
by Sagacity (Monk) on Jan 03, 2009 at 08:44 UTC
|
Hi,
It has been awhile, but I once used saving Word docs to rtf.
The added rtf codes gave me the regex's to use. It just seemed easier at the time, and it blocked the data into recognizable patterns that could then be used to suck out the needed information.
Give 1 a try,
You'll see what I'm talking about.
Good Luck!
| [reply] |