in reply to Re^3: Extracting a (UK) Address
in thread Extracting a (UK) Address

e.g. is there always a titled name (Mr,Mrs,Miss,Ms etc.) at the start of the address ?, If so analyze all the names in your dataset to identify all unique titles.

177 Elm Road, is the only line that starts with a number so the address is in that block

Address lines are the only ones that end with a comma, so use that block

The address is always in the first n lines of the invoice ?

The address always has a county in it ?

I would assume that there isn't always a title - invoices are often addressed to a company, or to "The Mumble officer".

Addresses don't always include a number. Some houses only have names. eg, Mumble Farm, Mumblevillage, Kent

Address lines don't always end in a comma - although that at least should be consistent in any particular version of an application. Should. Unless addresses were entered manually in a multi-line text box.

Addresses don't always have a county in them. The county is no longer required.

And you won't have a post code for all addresses. The solution *in this case* is to look for whatever is between "Vat No: blahblah" and "description". In the general case, you're buggered, because you can't tell the difference between this:

My new address is as follows. Please send all correspondence to: Tottenham Hotspur, Edinburgh Rd, Bexhill. Blah blah blah blah
(yes that's a real address - someone who lives near my parents named his house after the football club.

and this:

In my address to the society I looked at the apparent links between: Tottenham Hotspur, Edinburgh Road, Bexhill. I used them to demonstrate the difference between correlation and causation.

The solution we adopt where I work to a somewhat similar problem is to have the machine look for the data and attempt to parse it, but to show a person the data *in context* before committing it to the database. This is much quicker than having a person find/highlight/cut/paste the data, while still having reasonable quality control. Where the computer gets it right, it's a lot faster than a person, and where the computer gets it wrong, a person can fix it.