clinton has asked for the wisdom of the Perl Monks concerning the following question:

We have millions of entries like the following:
SMITH. - John. 'Beloved husband of M a r y, dearly loved Dad of Jack, Jill and Jane, much loved Grandad of May, Alf-red and Elijah. Passed away on February 11, 2006, aged 82 years. Funeral service to be held at Wickley Crematorium Bondis Hill Chapel on Thursday, February 28, at 2.15pm. Flowers may be sent care of J.B. Smith and Sons Ltd, 93 South Park Road, Iliffe. SX1 2PY.

We would like to parse this text, and derive some semantic meaning from the names, dates and places, for instance:

As a side benefit, it would be good to tidy up the text, eg:

SMITH. - John. 'Beloved => John Smith. Beloved M a r y => Mary Alf-red => Alfred

Not all of these records follow this same format. For instance, some of them may begin with "Look who's 60!"

I can imagine a series of filters that deal with known formatting errors, followed by some keyword (and key phrase) extraction, and matching against dictionaries, lists of peoples names and place names.

As a human, this is easily achievable, but how big/doable a task would this be for a programmer to automate? How long would a good developer need to produce a semi-reliable system to perform this task, if it can be done at all? Would it be feasible to run this process online (as in, when the notice is created), or would it be too time consuming?

thanks

Clint

  • Comment on Extracting structured data from unstructured text - just how difficult would this be?
  • Download Code

Replies are listed 'Best First'.
Re: Extracting structured data from unstructured text - just how difficult would this be?
by moklevat (Priest) on Feb 21, 2008 at 16:18 UTC
    In response to your question, I'm going to say "quite difficult" or at least very time consuming. On the other hand, if the point is to get work done, then I think Amazon has already created the system you are looking for with the Mechanical Turk.
      That may just be a brilliant solution - good thinking batman!

      The only downside is that we have to verify their work, which may be almost as time consuming

        Perhaps you could set your system up to have duplicate data entry, and then diff the duplicate entries to flag potential problems. Alternately you could set up a second Turk task to compare and verify entries.
Re: Extracting structured data from unstructured text - just how difficult would this be?
by Corion (Patriarch) on Feb 21, 2008 at 16:08 UTC

    I see basically two ways to approach the extraction of data from the text:

    1. Text understanding - the program would by some magic understand the text and analyze it for subject, object, verbs and how they relate to each other. This is a problem that has been tackled by the AI projects, and as far as I know, has failed. On the other hand, your problem is restricted to a fairly limited domain, notices about births, anniversaries and deaths, so the words, relations and synonyms might be limited.
    2. Phrase recognition - this is what Perl can easily do, but without a sufficient corpus, it's hard to come up with phrases to extract the information from each notice. The extraction would be an iterative process where the rules/phrases to extract information will need to be weighted to exclude conflicts but still maximize the amount of information extracted. I could imagine that each notice gets split up into sentences and then the program prompts you to craft a regular expression to handle the "most common" form of sentence in the remaining notices (maybe commonality is by count of common words?).

    After you've extracted the data, you need to come up with a model for all that data and some algorithms to derive other data from the data you have. I would consider at least in the prototype stage to formulate the rules for missing data as Prolog code. This will be slow, but you can link to a real Prolog interpreter if the Prolog approach is fruitful or recode your rules into Perl if you don't want Prolog (untested):

    died(smith_jon) :- 20060211. age(smith_jon) :- 82. # Some magic for date and timespan conversion is missing here! year_born(X) :- died(X) - age(X). ?year_born(smith_jon)

    From that rule (with the appropriate sprinkling of date and timespan conversion applied), you can infer (in Prolog) year_born, died or age, provided the other two are available and the appropriate data type conversions are possible in Prolog. If you've come up with the rules but Prolog doesn't have the appropriate type conversion, you'll need to hardcode the rules in Perl.

        Along with the AI and Prolog route, you may also want to look at Data Mining and Data Extraction techniques.
        I see that there are some Perl modules on CPAN but do not know if any will fit you requirements.

        Update Another thought would be some sort of NLP templating system similar to those used in automatic summerization systems
        I sent a private message to clinton regarding a paper about data extraction using templating.
        He kindly thought that this was written by me, if only! and that I should post it in.
        It was written by a friend of mine Dr Michael Oakes together with Dr Chris Paice both acknowledged experts in Automatic Summarisation.
        This paper outlines the use of templates for extracting specific text and takes advantage of the technical term of phrase used in domain specific papers such as “the effect of x on y” this has some similarity with the required extraction of name, date of birth, date of death etc required in clinton’s question.
        This semantic regularity is captured by contextual patterns or templates. These templates are then compared with text sentences and whenever a match is found become candidates for “slots” found in the text. Where more than one possible match is found this problem is carried over to the second stage which uses a process of weighting to provide stronger evidence for slots to be filled than others.
        This method of extraction gave some favourable results in testing in part due to the limited vocabulary in use within the test data, this would however restrict its effectiveness in other fields using a much larger vocabulary.

        The Holy Grail
        Concept based abstraction concepts ( a text may be related to one another based on salience) using automatic template construction across domains seems to hold great promise for the future.
        There has been considerable progress in well structured technical documents through template extraction and some progress in extraction from non technical documents using Artificial Intelligence techniques however a lot of problems come from the poor structure within the documents themselves. The limits that at present restrict sentence extraction techniques need to be overcome using AI techniques and there now seems to be a movement towards fuzzy clustering for data extraction.
Re: Extracting structured data from unstructured text - just how difficult would this be?
by Limbic~Region (Chancellor) on Feb 21, 2008 at 16:44 UTC
    clinton,
    So hear is me talking out my @$$

    Typically, what I would suggest is the prospector method. You pass the text through a series of finer filters until you find the nuggets you are looking for. Things that fall out or records that yield no gold are examined by hand to determine why. Either existing filters are modified or new ones are added. This can reduce the amount of manually work greatly.

    The trouble with using this approach for this task is that even if it finds nuggets, you have to be sure you haven't been handed Fool's Gold. This is where using Bayesian and expert systems may help. Suppose you have more than just one series of filter and/or you have multiple dispositions (gold, pyrite, dirt, your wedding band, etc). By "ranking" the output, you can "teach" the system to recognize which filters and weights work best for which types of text.

    Of course, you will never achieve 100% accuracy but I don't think you want to. Way over my head though.

    Cheers - L~R

      I was thinking about something along exactly these lines, so we may just be two talking @$$'$

      What'd be interesting is trying to look for "contextual words", so does May refer to the month or the daughter, London is a place, or Jack London. It would be impossible to predict all of these ambiguities, so the "training" makes a lot of sense to me.

      Of course, you will never achieve 100% accuracy but I don't think you want to.

      Absolutely correct - we don't depend on this data, it just adds value when we can extract it.

      thanks for the input

      Clint

Re: Extracting structured data from unstructured text - just how difficult would this be?
by dragonchild (Archbishop) on Feb 21, 2008 at 16:00 UTC
    The reason a human can read this is because a human has context that a computer does not. For example, that a name was expected where "M a r y" appeared and that "Mary" is an acceptable name. I suspect that someone from China who learned English in a class would have a lot of problems reading that text.

    Extract structure from unstructured text is (for the nonce) considered to be the "gateway to AI" (similar to how chess was in the 1970's). Good luck?


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      The reason a human can read this is because a human has context that a computer does not. For example, that a name was expected where "M a r y" appeared and that "Mary" is an acceptable name.

      This is where I was thinking about the filters. These notices come from newspapers, so the 's e p a r a t e d' letters is a common form - a filter could look for this, and recombine it, with a high probability of getting it right

      Extract structure from unstructured text is (for the nonce) considered to be the "gateway to AI" (similar to how chess was in the 1970's). Good luck?

      Again, we'd have the shortcut of knowing that the common form is (eg) "SURNAME firstname" or "SURNAME FIRSTNAME" or something like that, so where this appears, (possibly combined with a list of stop words), we'd have a good chance of identifying the subject of the announcement

      And I was thinking that by matching (eg) words beginning with a capital against lists of place or person names, we could get at least some of the way to extracting the other data.

      This appears relatively simple, but that may just be my lack of experience in this field:)

        This appears relatively simple, but that may just be my lack of experience in this field:)

        Try it. A lot of amazing advances have been made by people who didn't know it was impossible to what they just did. My suspicion is that you're going to find that providing sufficient context is going to be NP-hard. But, don't listen to me. Seriously.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Extracting structured data from unstructured text - just how difficult would this be?
by John M. Dlugosz (Monsignor) on Feb 22, 2008 at 00:29 UTC
    Try it. It is part of a routine, not something that must function perfectly on its own. Anytime it does help is a win!

    I would identify templates used by various sources. That is, find messages that look alike. Write code to recognize that template, and extract.

    If it doesn't match that template, try to recognize the next one, and so on.

    Present the text and the filled-out record to the operator. He can correct the fields, fill in the ones it didn't get, and generally have less work to do than filling out the form by hand from scratch.

    Keep a new list of the ones that did not parse at all, for further investigation. Keep a list of those that needed correction, for tweeking of the parser.

    I've worked that way when importing data. It didn't have to work in every case, just save me work. Pulling out the ones that worked and looking again only at what doesn't work turned out well, for identifying new patterns.

    —John

Re: Extracting structured data from unstructured text - just how difficult would this be?
by CountZero (Bishop) on Feb 22, 2008 at 07:18 UTC
    Just some ideas:
    • Spam-fighting programs also have to deal with "mangled" input (think of all the creative ways spammers write "viagra") so their technology also has to bring these different formats to a single "basic" format (say, all spaces and strange characters removed, lowercased form) before the spam-detection routines can be let loose on this "basic" format.
    • Once that has been done, I think a Bayesian filter can put your different messages in categories
    • These content based categories will then need to be split up or filtered for the various elements you wish to extract, perhaps by first trying to recognize the general "form" the message has been written in. A template based approach may perhaps be a good idea here.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James