Re^3: Extracting structured data from unstructured text

I sent a private message to clinton regarding a paper about data extraction using templating.
He kindly thought that this was written by me, if only! and that I should post it in.
It was written by a friend of mine Dr Michael Oakes together with Dr Chris Paice both acknowledged experts in Automatic Summarisation.
This paper outlines the use of templates for extracting specific text and takes advantage of the technical term of phrase used in domain specific papers such as “the effect of x on y” this has some similarity with the required extraction of name, date of birth, date of death etc required in clinton’s question.
This semantic regularity is captured by contextual patterns or templates. These templates are then compared with text sentences and whenever a match is found become candidates for “slots” found in the text. Where more than one possible match is found this problem is carried over to the second stage which uses a process of weighting to provide stronger evidence for slots to be filled than others.
This method of extraction gave some favourable results in testing in part due to the limited vocabulary in use within the test data, this would however restrict its effectiveness in other fields using a much larger vocabulary.

The Holy Grail
Concept based abstraction concepts ( a text may be related to one another based on salience) using automatic template construction across domains seems to hold great promise for the future.
There has been considerable progress in well structured technical documents through template extraction and some progress in extraction from non technical documents using Artificial Intelligence techniques however a lot of problems come from the poor structure within the documents themselves. The limits that at present restrict sentence extraction techniques need to be overcome using AI techniques and there now seems to be a movement towards fuzzy clustering for data extraction.

Comment on Re^3: Extracting structured data from unstructured text - just how difficult would this be?