Seems that for this kind of task, you are better off with a state machine. Something that allows you to identify an interesting section of your document which can be analysed / scrutinised for the interesting stuff. So the problem will be to identify keywords that mark the start of an address field and to have an idea of how an address field ends. If you're lucky, the address field has a fixed number of lines.

Something along:

use strict; sub flush_address { # may need more code to narrow down the address print "FOUND:\n>> ", join(">> ", @{$_[0]}), "======\n"; } my @address; my $line_span = -1; # -1 disabled; otherwise extract $line_span lines while (<>) { # identify start of an address section (upd.: regexp incomplete) push(@address,$_), next if /^\s*(Miss|Mister|Mr\.?|Ms\.?|Her|His)\s +/; if (@address) { push @address, $_; # identify end of an address section # regexp matches empty line here but should match something # like "BN2 ..." if (@address == $line_span or /^\s*$/) { flush_address(\@address); @address = (); } } } flush_address(\@address) if @address; __END__ pb> perl 733738.pl invoice.txt FOUND: >> Miss ***** ****** >> 1** Elm ****, >> Bri***** >> E*** ****** >> BN2 *** >> ======

Some free associations of potentially useful links: Parsing addresses, Efficient Fuzzy Matching Of An Address, Extracting Bibliography Citations, validate a postal code, Extracting information from a MS WORD Document, Pull all text from msword document, ...

Update: ... I assumed, that the sample invoice was anonymised already, but - just in case - made the sample output unreadable.


In reply to Re: Extracting a (UK) Address by Perlbotics
in thread Extracting a (UK) Address by ropey

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.