in reply to Re: A Name is a Name is a Name
in thread A Name is a Name is a Name
Thanks, I'll look at BibTex. Luckily, I'm only spidering US websites, and there are about five name format conventions in use, all of which are pretty simple to parse. I have dictionaries of about 120K of the most common first, last names in the US, so most of the time I can correctly identify the name just by doing a dictionary match. Where things get ugly is where there's a place name or a second name on the same page that's not the artists's name but matches the dictionary anyway. Then I look at things like font, the location of the name (is it by itself on a line or in a description of the work or perhaps a href), and decide which of the multiple dictionary matches are the artists's name. The other tricky scenario is where the dictionary matches fail - which seems to occur pretty rarely - only about 5% of the time. In that case I look at capitalization and the other factors mentioned previously - usually a only proper nouns have 2+ consecutive first letter only capitalizations. This latter matching technique seems to work about 30% of the time, the other 70% it picks up a place or gallery name, which can sometimes be person's name as well. In other applications, like news readers, state of the art accuracy rate is around 90% - for the proprietary software I mentioned in my original post. If I can get mine code working at that rate I'll be happy ;-)