in reply to web-and-perl-based Named Entity annotator

Lingua::EN::NamedEntity is pretty raw -- you won't be getting 80% of the way there with that module (more like 55%).

If you're willing to step outside the Perl paradigm, and it sounds like you are, you should look at GATE (specifically, the "Annie" part) and FreeLing. Both are Open Source linguistics packages, though both are, IIRC, in Java. Specifically, GATE has almost all if what you're looking for (but you have to slog through the ridiculously verbose Java).

There are a couple of cool Python linguistics packages available, but the one I liked best was not open source -- a deal killer for the particular project I was working on.

My caveat is that the harder part of this for me was keeping good track of all of the documents I got (for attribution to primary sources) through all of the cleanup, processing, and similar steps (think: HTML stripping, removal of telegraphic headlines, etc.). Also, if you do have to use Java, it's *so not* a text processing language that you may give up in disgust.

If you're looking for academic papers and the like on this topic, look for "named entity extraction" as a task in both cognitive linguistics and information extraction. Finally, if you succeed at extracting and annotating your entities, you'll eventually also hit against the "named entity disambiguation" problem, which is a specialized subset of "merge and purge." In the commercial world, the HNC Software (Fair, Issac) guys are big master of this. There's also been a lot of post-2001 funding on doing this in various squiggly R-L languages, if you know what I mean.

  • Comment on Re: web-and-perl-based Named Entity annotator