comment on

I sent a private message to clinton regarding a paper about data extraction using templating.
He kindly thought that this was written by me, if only! and that I should post it in.
It was written by a friend of mine Dr Michael Oakes together with Dr Chris Paice both acknowledged experts in Automatic Summarisation.
This paper outlines the use of templates for extracting specific text and takes advantage of the technical term of phrase used in domain specific papers such as “the effect of x on y” this has some similarity with the required extraction of name, date of birth, date of death etc required in clinton’s question.
This semantic regularity is captured by contextual patterns or templates. These templates are then compared with text sentences and whenever a match is found become candidates for “slots” found in the text. Where more than one possible match is found this problem is carried over to the second stage which uses a process of weighting to provide stronger evidence for slots to be filled than others.
This method of extraction gave some favourable results in testing in part due to the limited vocabulary in use within the test data, this would however restrict its effectiveness in other fields using a much larger vocabulary.

The Holy Grail
Concept based abstraction concepts ( a text may be related to one another based on salience) using automatic template construction across domains seems to hold great promise for the future.
There has been considerable progress in well structured technical documents through template extraction and some progress in extraction from non technical documents using Artificial Intelligence techniques however a lot of problems come from the poor structure within the documents themselves. The limits that at present restrict sentence extraction techniques need to be overcome using AI techniques and there now seems to be a movement towards fuzzy clustering for data extraction.

In reply to Re^3: Extracting structured data from unstructured text - just how difficult would this be? by Gavin
in thread Extracting structured data from unstructured text - just how difficult would this be? by clinton

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.