comment on

I see basically two ways to approach the extraction of data from the text:

Text understanding - the program would by some magic understand the text and analyze it for subject, object, verbs and how they relate to each other. This is a problem that has been tackled by the AI projects, and as far as I know, has failed. On the other hand, your problem is restricted to a fairly limited domain, notices about births, anniversaries and deaths, so the words, relations and synonyms might be limited.
Phrase recognition - this is what Perl can easily do, but without a sufficient corpus, it's hard to come up with phrases to extract the information from each notice. The extraction would be an iterative process where the rules/phrases to extract information will need to be weighted to exclude conflicts but still maximize the amount of information extracted. I could imagine that each notice gets split up into sentences and then the program prompts you to craft a regular expression to handle the "most common" form of sentence in the remaining notices (maybe commonality is by count of common words?).

After you've extracted the data, you need to come up with a model for all that data and some algorithms to derive other data from the data you have. I would consider at least in the prototype stage to formulate the rules for missing data as Prolog code. This will be slow, but you can link to a real Prolog interpreter if the Prolog approach is fruitful or recode your rules into Perl if you don't want Prolog (untested):

died(smith_jon) :- 20060211.
age(smith_jon)  :- 82.

# Some magic for date and timespan conversion is missing here!

year_born(X) :- died(X) - age(X).
?year_born(smith_jon)
[download]

From that rule (with the appropriate sprinkling of date and timespan conversion applied), you can infer (in Prolog) year_born, died or age, provided the other two are available and the appropriate data type conversions are possible in Prolog. If you've come up with the rules but Prolog doesn't have the appropriate type conversion, you'll need to hardcode the rules in Perl.

In reply to Re: Extracting structured data from unstructured text - just how difficult would this be? by Corion
in thread Extracting structured data from unstructured text - just how difficult would this be? by clinton

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.