in reply to Re: Finding dates in unstructured text
in thread Finding dates in unstructured text
If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.Actually, this would be a bad idea for at least two reasons: First, you have to segment the text before determining whether or not the segments are dates. Second, you have to have labeled data to train a classifier. A better approach would be to look through your data by hand and generalize to create a set of regular expressions (or, more generally, date-identifying functions). Once you have some of these, run them on more of your data, and refine them to include dates that they missed, and to exclude non-dates that they picked up. Keep doing this until you get the performance you need.
|
|---|