punkish has asked for the wisdom of the Perl Monks concerning the following question:

nlp monks,

My project involves carbon-based life forms annotating text files with named entities (people, organizations, places, etc.) to create "ground truth" that can be then fed to someone else's programmatic annotator to make it smarter.

The work done thus far (before I joined the project) was using Callisto, a Java annotator created by Mitre Corp. The results were less than satisfying, and besides, Callisto ain't open source.

I have been looking at Wordfreak, which, besides having a cool name, is open source.

One problem -- both of the above are Java programs, something I don't know beans about. Although this is not exigent, I would like to write a web-based interface for human annotation of text files.... so, the human expert goes to my application, uploads her text file, the program rips through it, presents the text in one frame, a popup widget shows the available entities (customizable, of course), and then, the user can select words, one-by-one, in the text frame, choose the applicable entity-type in the entity frame, and when she is finished, the program generates an xml-ish annotation file. Of course, I would start with Lingua::EN::NamedEntity as the backend.

Ok. So, before I embark on this, any monks aware of this having been done already? Any other thoughts, gotchas, caveats?

--

when small people start casting long shadows, it is time to go to bed
  • Comment on web-and-perl-based Named Entity annotator

Replies are listed 'Best First'.
Re: web-and-perl-based Named Entity annotator
by jbert (Priest) on Oct 13, 2006 at 09:32 UTC
    If I understand correctly, you want an app where people can tag individual words in a text.

    To get any decent kind of interactivity (i.e. avoid a page reload on each tag) out of a web app for the usage you describe, you'll have to get all Web 2.0 and do client-side javascript etc.

    If it doesn't *have* to be a web app, then this would be a fairly straightforward perl/tk or perl/gtk GUI. The only bit which might be hard is choosing the right widget for your large amount of text and managing the word selection.

    And I don't know of any existing tools. You might be able to abuse an HTML editor for this purpose. You could add a tag, e.g. <span> to a word and put the tag name in the 'id' attribute or similar. If you used <em> instead of <span> you could even see what words were already tagged.

    Post-processing the HTML to your desired format should be fairly straightforward with CPAN's help.

Re: web-and-perl-based Named Entity annotator
by rlucas (Scribe) on Oct 15, 2006 at 16:15 UTC
    Lingua::EN::NamedEntity is pretty raw -- you won't be getting 80% of the way there with that module (more like 55%).

    If you're willing to step outside the Perl paradigm, and it sounds like you are, you should look at GATE (specifically, the "Annie" part) and FreeLing. Both are Open Source linguistics packages, though both are, IIRC, in Java. Specifically, GATE has almost all if what you're looking for (but you have to slog through the ridiculously verbose Java).

    There are a couple of cool Python linguistics packages available, but the one I liked best was not open source -- a deal killer for the particular project I was working on.

    My caveat is that the harder part of this for me was keeping good track of all of the documents I got (for attribution to primary sources) through all of the cleanup, processing, and similar steps (think: HTML stripping, removal of telegraphic headlines, etc.). Also, if you do have to use Java, it's *so not* a text processing language that you may give up in disgust.

    If you're looking for academic papers and the like on this topic, look for "named entity extraction" as a task in both cognitive linguistics and information extraction. Finally, if you succeed at extracting and annotating your entities, you'll eventually also hit against the "named entity disambiguation" problem, which is a specialized subset of "merge and purge." In the commercial world, the HNC Software (Fair, Issac) guys are big master of this. There's also been a lot of post-2001 funding on doing this in various squiggly R-L languages, if you know what I mean.

Re: web-and-perl-based Named Entity annotator
by planetscape (Chancellor) on Oct 16, 2006 at 08:34 UTC