Re^3: blocking site scrapers
by chargrill (Parson) on Feb 07, 2006 at 04:48 UTC
|
Well, that certainly makes more sense than say, dynamically altering firewall rules (yes, I've seen that). :)
A well behaved search engine bot SHOULD be discernable by their UA (doubt the script kiddies bother to change theirs), and you may want to note whether a client requests or has requested /robots.txt...
Granted, none of this is a sure thing, but a combination of "tests" may get you close enough to what you want without restricting others...
--chargrill
$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ }
+ sig
map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu
+" );
| [reply] [d/l] |
|
|
What's wrong with dynamically altering firewall rules? Before answering you should perhaps consider that firewalls can be used for tarpitting (i.e. slowing down connections to the point of unusability) or rate-limiting individual addresses or address ranges, as well as simple blocking. In fact, if you have to resort to an IP-based policy (generally a bad idea), a well-implemented firewall solution is usually a better idea than server-side request mangling.
To answer the OP's question, if you're on Linux you may want to look at the "recent" iptables extension. This article provides an introduction on how to use it. If you're on a different OS, have a look at that OS's firewall documentation.
| [reply] [d/l] |
|
|
SetEnvIf Request_URI "winnt/system32/cmd\.exe" worm
# etc ...
CustomLog "|exec sh" "/sbin/route -nq add -host %a 127.0.0.1 -blackhol
+e" env=worm
... so I guess to answer your question, the answer is that nothing is wrong with it per se. This was a somewhat popular method to block nimda, code red, sadmind, etc from doing too much damage to web servers a few years ago. More can be read here: log monitors and here: securityfocus. These links even suggest that indeed local or upstream firewalling would be more efficient.
--chargrill
$/ = q#(\w)# ; sub sig { print scalar reverse join ' ', @_ }
+ sig
map { s$\$/\$/$\$2\$1$g && $_ } split( ' ', ",erckha rlPe erthnoa stJu
+" );
| [reply] [d/l] [select] |
Re^3: blocking site scrapers
by spiritway (Vicar) on Feb 07, 2006 at 05:46 UTC
|
You could start to build a second database (or add a field in the present one) that would include IP numbers that requested robots.txt, or that identified themselves as Googlebot, SurveyBot, Yahoo!, ysearch, sohu-search, msnbot, RufusBot, netcraft.com, MMCrawler, Teoma, ConveraMultimediaCrawler, and whatever else seems to be reputable.
My main criterion for a bot being OK is if it asks for robots.txt. However, this isn't 100% reliable. There's a bot out there that uses robots.txt to scrape only the forbidden directories and pages, ignoring the allowed ones. It's called WebVulnScan or WebVulnCrawl. That's just plain rude.
But just a thought - if a search bot is burning your bandwidth, isn't that still something you'd want to avoid?
| [reply] |
|
|
| [reply] |
|
|
Well, the request they send would contain that text. For example, they'd say ""GET /robots.txt...". Their request usually contains other information, such as the IP they're using (or claiming to use), the name of the browser or user agent, and so on. A user agent might show up as "Mozilla/2.0(compatible; Ask Jeeves/Teoma;+http://sp.ask.com/docs/about/tech_crawling.html)". This is a polite bot that contains an address where you can get more information about it.
Of course, someone could fake most of that (maybe all of it), but they usually don't. And anyway, even if it's MotherTeresaBot, if it's hogging your bandwidth, it's still causing you problems.
| [reply] |
Re^3: blocking site scrapers
by mkirank (Chaplain) on Feb 09, 2006 at 07:10 UTC
|
If you have Mod_Perl Installed on your server you could use this technique given in the mod_perl book
Blocking Greedy Clients
| [reply] |