in reply to Re: Extracting structured data from unstructured text - just how difficult would this be?
in thread Extracting structured data from unstructured text - just how difficult would this be?

The reason a human can read this is because a human has context that a computer does not. For example, that a name was expected where "M a r y" appeared and that "Mary" is an acceptable name.

This is where I was thinking about the filters. These notices come from newspapers, so the 's e p a r a t e d' letters is a common form - a filter could look for this, and recombine it, with a high probability of getting it right

Extract structure from unstructured text is (for the nonce) considered to be the "gateway to AI" (similar to how chess was in the 1970's). Good luck?

Again, we'd have the shortcut of knowing that the common form is (eg) "SURNAME firstname" or "SURNAME FIRSTNAME" or something like that, so where this appears, (possibly combined with a list of stop words), we'd have a good chance of identifying the subject of the announcement

And I was thinking that by matching (eg) words beginning with a capital against lists of place or person names, we could get at least some of the way to extracting the other data.

This appears relatively simple, but that may just be my lack of experience in this field:)

  • Comment on Re^2: Extracting structured data from unstructured text - just how difficult would this be?

Replies are listed 'Best First'.
Re^3: Extracting structured data from unstructured text - just how difficult would this be?
by dragonchild (Archbishop) on Feb 21, 2008 at 16:21 UTC
    This appears relatively simple, but that may just be my lack of experience in this field:)

    Try it. A lot of amazing advances have been made by people who didn't know it was impossible to what they just did. My suspicion is that you're going to find that providing sufficient context is going to be NP-hard. But, don't listen to me. Seriously.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?