We have millions of entries like the following:
SMITH. - John. 'Beloved husband of M a r y, dearly loved Dad of Jack, Jill and Jane, much loved Grandad of May, Alf-red and Elijah. Passed away on February 11, 2006, aged 82 years. Funeral service to be held at Wickley Crematorium Bondis Hill Chapel on Thursday, February 28, at 2.15pm. Flowers may be sent care of J.B. Smith and Sons Ltd, 93 South Park Road, Iliffe. SX1 2PY.

We would like to parse this text, and derive some semantic meaning from the names, dates and places, for instance:

As a side benefit, it would be good to tidy up the text, eg:

SMITH. - John. 'Beloved => John Smith. Beloved M a r y => Mary Alf-red => Alfred

Not all of these records follow this same format. For instance, some of them may begin with "Look who's 60!"

I can imagine a series of filters that deal with known formatting errors, followed by some keyword (and key phrase) extraction, and matching against dictionaries, lists of peoples names and place names.

As a human, this is easily achievable, but how big/doable a task would this be for a programmer to automate? How long would a good developer need to produce a semi-reliable system to perform this task, if it can be done at all? Would it be feasible to run this process online (as in, when the notice is created), or would it be too time consuming?

thanks

Clint


In reply to Extracting structured data from unstructured text - just how difficult would this be? by clinton

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.