in reply to Extracting structured data from unstructured text - just how difficult would this be?
I see basically two ways to approach the extraction of data from the text:
After you've extracted the data, you need to come up with a model for all that data and some algorithms to derive other data from the data you have. I would consider at least in the prototype stage to formulate the rules for missing data as Prolog code. This will be slow, but you can link to a real Prolog interpreter if the Prolog approach is fruitful or recode your rules into Perl if you don't want Prolog (untested):
died(smith_jon) :- 20060211. age(smith_jon) :- 82. # Some magic for date and timespan conversion is missing here! year_born(X) :- died(X) - age(X). ?year_born(smith_jon)
From that rule (with the appropriate sprinkling of date and timespan conversion applied), you can infer (in Prolog) year_born, died or age, provided the other two are available and the appropriate data type conversions are possible in Prolog. If you've come up with the rules but Prolog doesn't have the appropriate type conversion, you'll need to hardcode the rules in Perl.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Extracting structured data from unstructured text - just how difficult would this be?
by clinton (Priest) on Feb 21, 2008 at 16:28 UTC | |
by Gavin (Archbishop) on Feb 21, 2008 at 17:34 UTC | |
by Gavin (Archbishop) on Feb 23, 2008 at 12:02 UTC |