in reply to Extracting structured data from unstructured text - just how difficult would this be?

I see basically two ways to approach the extraction of data from the text:

  1. Text understanding - the program would by some magic understand the text and analyze it for subject, object, verbs and how they relate to each other. This is a problem that has been tackled by the AI projects, and as far as I know, has failed. On the other hand, your problem is restricted to a fairly limited domain, notices about births, anniversaries and deaths, so the words, relations and synonyms might be limited.
  2. Phrase recognition - this is what Perl can easily do, but without a sufficient corpus, it's hard to come up with phrases to extract the information from each notice. The extraction would be an iterative process where the rules/phrases to extract information will need to be weighted to exclude conflicts but still maximize the amount of information extracted. I could imagine that each notice gets split up into sentences and then the program prompts you to craft a regular expression to handle the "most common" form of sentence in the remaining notices (maybe commonality is by count of common words?).

After you've extracted the data, you need to come up with a model for all that data and some algorithms to derive other data from the data you have. I would consider at least in the prototype stage to formulate the rules for missing data as Prolog code. This will be slow, but you can link to a real Prolog interpreter if the Prolog approach is fruitful or recode your rules into Perl if you don't want Prolog (untested):

died(smith_jon) :- 20060211. age(smith_jon) :- 82. # Some magic for date and timespan conversion is missing here! year_born(X) :- died(X) - age(X). ?year_born(smith_jon)

From that rule (with the appropriate sprinkling of date and timespan conversion applied), you can infer (in Prolog) year_born, died or age, provided the other two are available and the appropriate data type conversions are possible in Prolog. If you've come up with the rules but Prolog doesn't have the appropriate type conversion, you'll need to hardcode the rules in Perl.

  • Comment on Re: Extracting structured data from unstructured text - just how difficult would this be?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Extracting structured data from unstructured text - just how difficult would this be?
by clinton (Priest) on Feb 21, 2008 at 16:28 UTC
      Along with the AI and Prolog route, you may also want to look at Data Mining and Data Extraction techniques.
      I see that there are some Perl modules on CPAN but do not know if any will fit you requirements.

      Update Another thought would be some sort of NLP templating system similar to those used in automatic summerization systems
      I sent a private message to clinton regarding a paper about data extraction using templating.
      He kindly thought that this was written by me, if only! and that I should post it in.
      It was written by a friend of mine Dr Michael Oakes together with Dr Chris Paice both acknowledged experts in Automatic Summarisation.
      This paper outlines the use of templates for extracting specific text and takes advantage of the technical term of phrase used in domain specific papers such as “the effect of x on y” this has some similarity with the required extraction of name, date of birth, date of death etc required in clinton’s question.
      This semantic regularity is captured by contextual patterns or templates. These templates are then compared with text sentences and whenever a match is found become candidates for “slots” found in the text. Where more than one possible match is found this problem is carried over to the second stage which uses a process of weighting to provide stronger evidence for slots to be filled than others.
      This method of extraction gave some favourable results in testing in part due to the limited vocabulary in use within the test data, this would however restrict its effectiveness in other fields using a much larger vocabulary.

      The Holy Grail
      Concept based abstraction concepts ( a text may be related to one another based on salience) using automatic template construction across domains seems to hold great promise for the future.
      There has been considerable progress in well structured technical documents through template extraction and some progress in extraction from non technical documents using Artificial Intelligence techniques however a lot of problems come from the poor structure within the documents themselves. The limits that at present restrict sentence extraction techniques need to be overcome using AI techniques and there now seems to be a movement towards fuzzy clustering for data extraction.