Lingua::EN::NamedEntity is pretty raw -- you won't be getting 80% of the way there with that module (more like 55%).

If you're willing to step outside the Perl paradigm, and it sounds like you are, you should look at GATE (specifically, the "Annie" part) and FreeLing. Both are Open Source linguistics packages, though both are, IIRC, in Java. Specifically, GATE has almost all if what you're looking for (but you have to slog through the ridiculously verbose Java).

There are a couple of cool Python linguistics packages available, but the one I liked best was not open source -- a deal killer for the particular project I was working on.

My caveat is that the harder part of this for me was keeping good track of all of the documents I got (for attribution to primary sources) through all of the cleanup, processing, and similar steps (think: HTML stripping, removal of telegraphic headlines, etc.). Also, if you do have to use Java, it's *so not* a text processing language that you may give up in disgust.

If you're looking for academic papers and the like on this topic, look for "named entity extraction" as a task in both cognitive linguistics and information extraction. Finally, if you succeed at extracting and annotating your entities, you'll eventually also hit against the "named entity disambiguation" problem, which is a specialized subset of "merge and purge." In the commercial world, the HNC Software (Fair, Issac) guys are big master of this. There's also been a lot of post-2001 funding on doing this in various squiggly R-L languages, if you know what I mean.


In reply to Re: web-and-perl-based Named Entity annotator by rlucas
in thread web-and-perl-based Named Entity annotator by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.