Re^2: perl regex or module that identifies bots/crawlers

Let's put it this way, when I don't block at all, my load average can peak well over 200, leaving me incapable of even logging in via ssh. It's also caused my system to crash. I ran a script that monitors load averages, and when it goes over 20, it reports the top active programs. At those points, it's always the search scripts responding to crawlers. (Research seems to validate others having the problem. See http://www.jensense.com/archives/2006/06/yahoo_search_ma.html)

Since I installed the simplistic testing, I've never had my load go over 1.0, even with my site's usual traffic of over 25,000 unique visitors a day. I have anywhere between 20-50 users doing searches at any given moment, according to my runtime logs.

The remnant bots that I don't check are not hurting, per se, but they are polluting my stats on what people search for. (I really want better data on what people are coming to my site FOR. The crawlers seem to be doing searches on random words.)

I also want to provide more options to searches, but because those would spin even more cpu cycles, I'd rather wait till I can really block out the cruft of these remnant bots.

I'm not concerned about blocking bots that are trying to mask themselves as normal users yet--they haven't presented themselves to be too much of a problem. I can sense illicit activity by monitoring when searches are done in a short timeframe (like within a second of the last one). That's a sure sign of a non-human, but I'd rather nip the problem in the bud if I can.

Dan Heller
http://www.danheller.com

Comment on Re^2: perl regex or module that identifies bots/crawlers

Replies are listed 'Best First'.
Re^3: perl regex or module that identifies bots/crawlers by Sartan (Pilgrim) on Mar 20, 2007 at 23:40 UTC
The company I work for gets over 9 million hits per day. Many are from bots but bots typically just act like regular users in order to crawl your site for caching/search/things like that. I would take some time to look at your code to see what is going on. 25K visitors causing a load of over 200 points to something other then just spiders crawling your site. What would happen if you had 25K valid users? You're site crashes? That's probably not what you want. Where I work I use a combination of ip address and user agent string to identify search engines. We don't block them per say(we still want to show up in google searches) we just don't give them shopping carts or do other tracking type work with them D	[reply]
Re^3: perl regex or module that identifies bots/crawlers by CountZero (Bishop) on Mar 20, 2007 at 22:58 UTC
I never realised that robots could have such an impact on a web-site. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^3: perl regex or module that identifies bots/crawlers by UnderMine (Friar) on Mar 23, 2007 at 10:24 UTC
I had a similar problem a few years ago when I was using session tokens embedded in the URL. Due to the nature of the site the token related to versioned session data hence the token could branch. This acted as a wonderful spider trap as the URL were always different if it tried to retrack its steps and use another path that it had already tested. Got round this in the end by analysing the speed at which sessions were being updated and using that as a bot detector. Generally it is only impolite bots that are an issue. One request a minute is Ok but 50 a sec is not. If you can't do it on a session basis you might want to look at apache throttling. This is not ideal as you my well end up throttling everyone not just the bots. Trouble is a bot can use rotating IPs, disguised/changing user agents or anon-proxy servers to hide what it is. UnderMine OS History - Operating system history	[reply]