artyfarty has asked for the wisdom of the Perl Monks concerning the following question:

So I'm a week into writing a $#@%! load of Perl that identifies people's names on web sites when I realize that I should query you guys to see if someone's done this already and is willing to share code. I know people have done this already for applications like identifying people's names in news articles, but my internet search led down paths that always ended with proprietary and expensive software. I'm building a search site for art images. The backend spiders millions of art images on thousands of gallery web sites and then tries to identify which snippet of text on the web page is an artist name. I'm structuring the data associated with each image, not indexing, so I have to read in a firstname, MI, Lastname by separating these three fields from all the other text. So far my code works on a few sites (without templates), but it needs at least another week of testing and tuning to make it work on a wide variety of different sites, as the matching technique is fuzzy, and it has to train on a wide variety of cases. Has anyone heard of open source software that can shorten the drudgery? Even if the software was built for another application like news articles, it might be faster to tweak it to my application

Replies are listed 'Best First'.
Re: A Name is a Name is a Name
by Corion (Patriarch) on Sep 30, 2005 at 07:00 UTC

    If you're handling names on the internet, you should look at how (I think) BibTeX handles names and what data structure it sets up to properly handle them. The US-american convention of "First name, middle initial, last name" is non-existent in other cultural circles, and there are many variations, like people without a middle initial, people with more than one middle name, people with a nobility title, people with an academical title, and the combinations of that.

    If you're storing peoples' names and don't want to offend the people, you better get their names right :-)

      Thanks, I'll look at BibTex. Luckily, I'm only spidering US websites, and there are about five name format conventions in use, all of which are pretty simple to parse. I have dictionaries of about 120K of the most common first, last names in the US, so most of the time I can correctly identify the name just by doing a dictionary match. Where things get ugly is where there's a place name or a second name on the same page that's not the artists's name but matches the dictionary anyway. Then I look at things like font, the location of the name (is it by itself on a line or in a description of the work or perhaps a href), and decide which of the multiple dictionary matches are the artists's name. The other tricky scenario is where the dictionary matches fail - which seems to occur pretty rarely - only about 5% of the time. In that case I look at capitalization and the other factors mentioned previously - usually a only proper nouns have 2+ consecutive first letter only capitalizations. This latter matching technique seems to work about 30% of the time, the other 70% it picks up a place or gallery name, which can sometimes be person's name as well. In other applications, like news readers, state of the art accuracy rate is around 90% - for the proprietary software I mentioned in my original post. If I can get mine code working at that rate I'll be happy ;-)
Re: A Name is a Name is a Name
by EvanCarroll (Chaplain) on Sep 30, 2005 at 05:55 UTC
    Your never going to be able to tell a person's name unless they explicitly tell you in some way that is their name, even if implied with nothing more than formatting. What if the picture was called the "Mona Lisa"? You're going to have get down and dirty with an html parser, try HTML::TokeParser::Simple, remember you get access to HTML::TokeParser through it, and I belieive HTMl::PullParser. Block out the tags you know you won't need, which is probably any non-table tag, non-image tag.


    Evan Carroll
    www.EvanCarroll.com