in reply to A Name is a Name is a Name

If you're handling names on the internet, you should look at how (I think) BibTeX handles names and what data structure it sets up to properly handle them. The US-american convention of "First name, middle initial, last name" is non-existent in other cultural circles, and there are many variations, like people without a middle initial, people with more than one middle name, people with a nobility title, people with an academical title, and the combinations of that.

If you're storing peoples' names and don't want to offend the people, you better get their names right :-)

Replies are listed 'Best First'.
Re^2: A Name is a Name is a Name
by artyfarty (Initiate) on Sep 30, 2005 at 17:17 UTC
    Thanks, I'll look at BibTex. Luckily, I'm only spidering US websites, and there are about five name format conventions in use, all of which are pretty simple to parse. I have dictionaries of about 120K of the most common first, last names in the US, so most of the time I can correctly identify the name just by doing a dictionary match. Where things get ugly is where there's a place name or a second name on the same page that's not the artists's name but matches the dictionary anyway. Then I look at things like font, the location of the name (is it by itself on a line or in a description of the work or perhaps a href), and decide which of the multiple dictionary matches are the artists's name. The other tricky scenario is where the dictionary matches fail - which seems to occur pretty rarely - only about 5% of the time. In that case I look at capitalization and the other factors mentioned previously - usually a only proper nouns have 2+ consecutive first letter only capitalizations. This latter matching technique seems to work about 30% of the time, the other 70% it picks up a place or gallery name, which can sometimes be person's name as well. In other applications, like news readers, state of the art accuracy rate is around 90% - for the proprietary software I mentioned in my original post. If I can get mine code working at that rate I'll be happy ;-)