SMITH. - John. 'Beloved husband of M a r y, dearly loved Dad of Jack, Jill and Jane, much loved Grandad of May, Alf-red and Elijah. Passed away on February 11, 2006, aged 82 years. Funeral service to be held at Wickley Crematorium Bondis Hill Chapel on Thursday, February 28, at 2.15pm. Flowers may be sent care of J.B. Smith and Sons Ltd, 93 South Park Road, Iliffe. SX1 2PY.
We would like to parse this text, and derive some semantic meaning from the names, dates and places, for instance:
As a side benefit, it would be good to tidy up the text, eg:
SMITH. - John. 'Beloved => John Smith. Beloved M a r y => Mary Alf-red => Alfred
Not all of these records follow this same format. For instance, some of them may begin with "Look who's 60!"
I can imagine a series of filters that deal with known formatting errors, followed by some keyword (and key phrase) extraction, and matching against dictionaries, lists of peoples names and place names.
As a human, this is easily achievable, but how big/doable a task would this be for a programmer to automate? How long would a good developer need to produce a semi-reliable system to perform this task, if it can be done at all? Would it be feasible to run this process online (as in, when the notice is created), or would it be too time consuming?
thanks
Clint
In reply to Extracting structured data from unstructured text - just how difficult would this be? by clinton
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |