in reply to Re: Extracting structured data from unstructured text - just how difficult would this be?
in thread Extracting structured data from unstructured text - just how difficult would this be?

Hmm, and even the examples given for Prolog deal with genealogy. Well, I've always wanted to try my hand at AI - this may be the time.

For interested readers:

Any other good links?

Clint

  • Comment on Re^2: Extracting structured data from unstructured text - just how difficult would this be?

Replies are listed 'Best First'.
Re^3: Extracting structured data from unstructured text - just how difficult would this be?
by Gavin (Archbishop) on Feb 21, 2008 at 17:34 UTC
    Along with the AI and Prolog route, you may also want to look at Data Mining and Data Extraction techniques.
    I see that there are some Perl modules on CPAN but do not know if any will fit you requirements.

    Update Another thought would be some sort of NLP templating system similar to those used in automatic summerization systems
Re^3: Extracting structured data from unstructured text - just how difficult would this be?
by Gavin (Archbishop) on Feb 23, 2008 at 12:02 UTC
    I sent a private message to clinton regarding a paper about data extraction using templating.
    He kindly thought that this was written by me, if only! and that I should post it in.
    It was written by a friend of mine Dr Michael Oakes together with Dr Chris Paice both acknowledged experts in Automatic Summarisation.
    This paper outlines the use of templates for extracting specific text and takes advantage of the technical term of phrase used in domain specific papers such as “the effect of x on y” this has some similarity with the required extraction of name, date of birth, date of death etc required in clinton’s question.
    This semantic regularity is captured by contextual patterns or templates. These templates are then compared with text sentences and whenever a match is found become candidates for “slots” found in the text. Where more than one possible match is found this problem is carried over to the second stage which uses a process of weighting to provide stronger evidence for slots to be filled than others.
    This method of extraction gave some favourable results in testing in part due to the limited vocabulary in use within the test data, this would however restrict its effectiveness in other fields using a much larger vocabulary.

    The Holy Grail
    Concept based abstraction concepts ( a text may be related to one another based on salience) using automatic template construction across domains seems to hold great promise for the future.
    There has been considerable progress in well structured technical documents through template extraction and some progress in extraction from non technical documents using Artificial Intelligence techniques however a lot of problems come from the poor structure within the documents themselves. The limits that at present restrict sentence extraction techniques need to be overcome using AI techniques and there now seems to be a movement towards fuzzy clustering for data extraction.