in reply to Re^2: blocking site scrapers
in thread blocking site scrapers
You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable.
My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude.
But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: blocking site scrapers
by Anonymous Monk on Feb 07, 2006 at 14:00 UTC | |
by spiritway (Vicar) on Feb 07, 2006 at 16:20 UTC |