in reply to Finding dates in unstructured text

Care to be more specific on what looks like a date?

20110112 ?

jan twelve ?

12 Janvier ?

XII 1 MMXI ?

OK, the last couple where a bit silly, but it illustrates the point. If you have a clear idea of what a date looks like, then a series of regular expressions is probably the way to go.

If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.

Replies are listed 'Best First'.
Re^2: Finding dates in unstructured text
by educated_foo (Vicar) on Jan 12, 2011 at 23:58 UTC
    If not, then I would start by training a Bayesian classifier, eg: Algorithm::NaiveBayes to find the bits of text, and then using them as examples to write regular expressions from.
    Actually, this would be a bad idea for at least two reasons: First, you have to segment the text before determining whether or not the segments are dates. Second, you have to have labeled data to train a classifier. A better approach would be to look through your data by hand and generalize to create a set of regular expressions (or, more generally, date-identifying functions). Once you have some of these, run them on more of your data, and refine them to include dates that they missed, and to exclude non-dates that they picked up. Keep doing this until you get the performance you need.