e.g. is there always a titled name (Mr,Mrs,Miss,Ms etc.) at the start of the address ?, If so analyze all the names in your dataset to identify all unique titles.

177 Elm Road, is the only line that starts with a number so the address is in that block

Address lines are the only ones that end with a comma, so use that block

The address is always in the first n lines of the invoice ?

The address always has a county in it ?

I would assume that there isn't always a title - invoices are often addressed to a company, or to "The Mumble officer".

Addresses don't always include a number. Some houses only have names. eg, Mumble Farm, Mumblevillage, Kent

Address lines don't always end in a comma - although that at least should be consistent in any particular version of an application. Should. Unless addresses were entered manually in a multi-line text box.

Addresses don't always have a county in them. The county is no longer required.

And you won't have a post code for all addresses. The solution *in this case* is to look for whatever is between "Vat No: blahblah" and "description". In the general case, you're buggered, because you can't tell the difference between this:

My new address is as follows. Please send all correspondence to: Tottenham Hotspur, Edinburgh Rd, Bexhill. Blah blah blah blah
(yes that's a real address - someone who lives near my parents named his house after the football club.

and this:

In my address to the society I looked at the apparent links between: Tottenham Hotspur, Edinburgh Road, Bexhill. I used them to demonstrate the difference between correlation and causation.

The solution we adopt where I work to a somewhat similar problem is to have the machine look for the data and attempt to parse it, but to show a person the data *in context* before committing it to the database. This is much quicker than having a person find/highlight/cut/paste the data, while still having reasonable quality control. Where the computer gets it right, it's a lot faster than a person, and where the computer gets it wrong, a person can fix it.


In reply to Re^4: Extracting a (UK) Address by DrHyde
in thread Extracting a (UK) Address by ropey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.