Re^2: blocking site scrapers

Replies are listed 'Best First'.
Re^3: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 04:48 UTC
Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :) A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt... Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others... --chargrill `$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );` [download]	[reply] [d/l]
Re^4: blocking site scrapers by tirwhan (Abbot) on Feb 07, 2006 at 10:50 UTC
What's wrong with dynamically altering firewall rules? Before answering you should perhaps consider that firewalls can be used for tarpitting (i.e. slowing down connections to the point of unusability) or rate-limiting individual addresses or address ranges, as well as simple blocking. In fact, if you have to resort to an IP-based policy (generally a bad idea), a well-implemented firewall solution is usually a better idea than server-side request mangling. To answer the OP's question, if you're on Linux you may want to look at the "recent" iptables extension. This article provides an introduction on how to use it. If you're on a different OS, have a look at that OS's firewall documentation. All dogma is stupid.	[reply] [d/l]
Re^5: blocking site scrapers by chargrill (Parson) on Feb 07, 2006 at 15:48 UTC
/me humbly searches through his own httpd.conf and finds `SetEnvIf Request_URI "winnt/system32/cmd\.exe" worm # etc ... CustomLog "\|exec sh" "/sbin/route -nq add -host %a 127.0.0.1 -blackhol +e" env=worm` [download] ... so I guess to answer your question, the answer is that nothing is wrong with it per se. This was a somewhat popular method to block nimda, code red, sadmind, etc from doing too much damage to web servers a few years ago. More can be read here: log monitors and here: securityfocus. These links even suggest that indeed local or upstream firewalling would be more efficient. --chargrill `$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ } + sig map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu +" );` [download]	[reply] [d/l] [select]
Re^3: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 05:46 UTC
You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable. My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude. But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?	[reply]
Re^4: blocking site scrapers by Anonymous Monk on Feb 07, 2006 at 14:00 UTC
Hi. Seeing if they checked for a robots.txt file sounds like a great idea but how would I know whether they did or not?	[reply]
Re^5: blocking site scrapers by spiritway (Vicar) on Feb 07, 2006 at 16:20 UTC
Well, the request they send would contain that text. For example, they'd say ""GET /robots.txt...". Their request usually contains other information, such as the IP they're using (or claiming to use), the name of the browser or user agent, and so on. A user agent might show up as "Mozilla/2.0(compatible; Ask Jeeves/Teoma;+http://sp.ask.com/docs/about/tech_crawling.html)". This is a polite bot that contains an address where you can get more information about it. Of course, someone could fake most of that (maybe all of it), but they usually don't. And anyway, even if it's MotherTeresaBot, if it's hogging your bandwidth, it's still causing you problems.	[reply]
Re^3: blocking site scrapers by mkirank (Chaplain) on Feb 09, 2006 at 07:10 UTC
If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book Blocking Greedy Clients	[reply]