So I'm a week into writing a $#@%! load of Perl that identifies people's names on web sites when I realize that I should query you guys to see if someone's done this already and is willing to share code. I know people have done this already for applications like identifying people's names in news articles, but my internet search led down paths that always ended with proprietary and expensive software. I'm building a search site for art images. The backend spiders millions of art images on thousands of gallery web sites and then tries to identify which snippet of text on the web page is an artist name. I'm structuring the data associated with each image, not indexing, so I have to read in a firstname, MI, Lastname by separating these three fields from all the other text. So far my code works on a few sites (without templates), but it needs at least another week of testing and tuning to make it work on a wide variety of different sites, as the matching technique is fuzzy, and it has to train on a wide variety of cases. Has anyone heard of open source software that can shorten the drudgery? Even if the software was built for another application like news articles, it might be faster to tweak it to my application

In reply to A Name is a Name is a Name by artyfarty

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.