You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable.
My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude.
But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?
In reply to Re^3: blocking site scrapers
by spiritway
in thread blocking site scrapers
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |