If you're handling names on the internet, you should look at how (I think) BibTeX handles names and what data structure it sets up to properly handle them. The US-american convention of "First name, middle initial, last name" is non-existent in other cultural circles, and there are many variations, like people without a middle initial, people with more than one middle name, people with a nobility title, people with an academical title, and the combinations of that.
If you're storing peoples' names and don't want to offend the people, you better get their names right :-)
| [reply] |
Thanks, I'll look at BibTex. Luckily, I'm only spidering US websites, and there are about five name format conventions in use, all of which are pretty simple to parse. I have dictionaries of about 120K of the most common first, last names in the US, so most of the time I can correctly identify the name just by doing a dictionary match. Where things get ugly is where there's a place name or a second name on the same page that's not the artists's name but matches the dictionary anyway. Then I look at things like font, the location of the name (is it by itself on a line or in a description of the work or perhaps a href), and decide which of the multiple dictionary matches are the artists's name. The other tricky scenario is where the dictionary matches fail - which seems to occur pretty rarely - only about 5% of the time. In that case I look at capitalization and the other factors mentioned previously - usually a only proper nouns have 2+ consecutive first letter only capitalizations. This latter matching technique seems to work about 30% of the time, the other 70% it picks up a place or gallery name, which can sometimes be person's name as well. In other applications, like news readers, state of the art accuracy rate is around 90% - for the proprietary software I mentioned in my original post. If I can get mine code working at that rate I'll be happy ;-)
| [reply] |
| [reply] |