perl regex or module that identifies bots/crawlers

argv has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: perl regex or module that identifies bots/crawlers by shigetsu (Hermit) on Mar 20, 2007 at 19:18 UTC
Perhaps HTTP::BrowserDetect's robot() method?	[reply]
Re^2: perl regex or module that identifies bots/crawlers by argv (Pilgrim) on Mar 20, 2007 at 21:56 UTC
Perhaps HTTP::BrowserDetect's robot() method? While I retain my enthusiasm for this module, and while it does precisely what I wanted it to do -- namely, to have a simplified/generic series of regex's that can determine whether a browser is a robot -- it suffers from a problem that plagues all who venture into this area: it's impossible to keep up with the robots. I've found numerous databases of known robot names, and all of them stipulate that none of these lists are complete. It is an unsolvable problem, which is the primary reason for the crypt glyphs you see on pages (that make you type something to prove you're a human). That said, the robot() method does a good enough job for now, and certainly well worth not having had to spend more time dealing with this problem. Great bang for the buck. perlmonks rescued me once again... Dan Heller http://www.danheller.com	[reply]
Re: perl regex or module that identifies bots/crawlers by Fletch (Bishop) on Mar 20, 2007 at 20:06 UTC
Keep in mind that as that information is provided by the client it's not to be trusted. Blocking based on it will keep out the ones that are honest, but there's no guarantee that the "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" cilent isn't really Nefarious J. Spammer's Goodtime Spamsalot Webcrawler. If you're really worried look into throttling over-active clients as well (I want to say merlyn had a Web Techniques column or three on doing this). Update: Ahh, yup: "Throttling your web server"; written for mod_perl 1.x and possibly getting long in the tooth, but the underlying concept is still sound even if you couldn't directly use the code.	[reply]
Re: perl regex or module that identifies bots/crawlers by duff (Parson) on Mar 20, 2007 at 19:17 UTC
I don't think anyone has done this already. Check CPAN though. However, if you do create a bot matcher, I suggest you use Regexp::Assemble to turn that list of bots into an efficient regular expression and more importantly, release your work as a module on CPAN :-) duff	[reply]
Re^2: perl regex or module that identifies bots/crawlers by argv (Pilgrim) on Mar 20, 2007 at 21:51 UTC
if you do create a bot matcher, I suggest you use Regexp::Assemble to turn that list of bots into an efficient regular expression Yow--this is great... I wish I could vote twice for that posting. Dan Heller http://www.danheller.com	[reply]
Re^2: perl regex or module that identifies bots/crawlers by arc_of_descent (Hermit) on Mar 23, 2007 at 07:02 UTC
Also check out the Regex::PreSuf module which can build a regex from a word list, not necessarily a list of regexes. -- Rohan	[reply]
Re: perl regex or module that identifies bots/crawlers by rhesa (Vicar) on Mar 21, 2007 at 01:25 UTC
Why not do this in Apache? Use mod_rewrite, and you won't need to spawn new processes. `RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^BadBot.* RewriteRule .* - [F]` [download] Another solution could be to use a rewrite map that contains all the bad bots you know: `RewriteMap robotmap dbm:/path/to/file/map.db RewriteCond %{HTTP_USER_AGENT} !="" RewriteCond ${robotmap:%{HTTP_USER_AGENT}\|NOT-FOUND} != NOT-FOUND RewriteRule .* - [F]` [download] This does require the use of a dbm map, since the USER_AGENT can contain spaces. It also requires exact matches, so you need to keep your map up to date. It should be very fast though. See also mod_rewrite manual.	[reply] [d/l] [select]
Re: perl regex or module that identifies bots/crawlers by sgifford (Prior) on Mar 20, 2007 at 21:15 UTC
Google and Yahoo should certainly be honoring your `robots.txt` file. You might want to take a closer look, to see what IP address these requests are coming from and what URLs they are fetching; perhaps there is another path to your `cgi-bin` directory that isn't being protected by your robots file, or maybe there is an error that's preventing your robots file from being processed correctly. -- sgifford's Web page	[reply] [d/l] [select]
Re^2: perl regex or module that identifies bots/crawlers by Anno (Deacon) on Mar 20, 2007 at 22:22 UTC
I agree that the real Google and Yahoo, and other big ones, will certainly honor `robots.txt`. If bots under their names invade a server that may only indicate that these are popular fake names for rogue bots. It would make sense to look like a legit bot instead of, for instance, a browser. That said, it is certainly a good idea to check if `robots.txt` is working as it should. Anno	[reply] [d/l] [select]
Re: perl regex or module that identifies bots/crawlers by gloryhack (Deacon) on Mar 21, 2007 at 02:02 UTC
I've found that some really rotten bots provide user agent strings claiming they're googlebot or slurp, but their IP addresses are nowhere near either of those two companies. The ones I've found being most abusive don't seem to move around much in IP space, so I just drop their packets via iptables and let it go at that. Because I have some traps set up on my web site, I check my server logs regularly to see who's wandered into globally excluded directories, and every agent claiming to be googlebot or slurp who's gone there has been an imposter. I do reject a few user agent strings, though. "PHP Script", "MSRBOT", and "Java*" are denied via (my) Apache's configuration because they're commonly found attempting to abuse my web-to-mail gateway. To answer the question you asked: I'm not aware of a ready-made module that will do what you want done.	[reply]
Re^2: perl regex or module that identifies bots/crawlers by Sartan (Pilgrim) on Mar 21, 2007 at 22:26 UTC
Thus my earlier suggestion of matching ips(here's one range for slurp I use..there are more `/^72\.30\.\d+\.\d+$/` ) with user agents (`/slurp/`). It tends to do us justice. --D	[reply] [d/l] [select]
Re: perl regex or module that identifies bots/crawlers by CountZero (Bishop) on Mar 20, 2007 at 20:33 UTC
Do you have any statistics on how many cycles you loose on robots' activity? In the long run you may be loosing more cycles in avoiding robots than they would have cost you. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^2: perl regex or module that identifies bots/crawlers by argv (Pilgrim) on Mar 20, 2007 at 21:48 UTC
Let's put it this way, when I don't block at all, my load average can peak well over 200, leaving me incapable of even logging in via ssh. It's also caused my system to crash. I ran a script that monitors load averages, and when it goes over 20, it reports the top active programs. At those points, it's always the search scripts responding to crawlers. (Research seems to validate others having the problem. See http://www.jensense.com/archives/2006/06/yahoo_search_ma.html) Since I installed the simplistic testing, I've never had my load go over 1.0, even with my site's usual traffic of over 25,000 unique visitors a day. I have anywhere between 20-50 users doing searches at any given moment, according to my runtime logs. The remnant bots that I don't check are not hurting, per se, but they are polluting my stats on what people search for. (I really want better data on what people are coming to my site FOR. The crawlers seem to be doing searches on random words.) I also want to provide more options to searches, but because those would spin even more cpu cycles, I'd rather wait till I can really block out the cruft of these remnant bots. I'm not concerned about blocking bots that are trying to mask themselves as normal users yet--they haven't presented themselves to be too much of a problem. I can sense illicit activity by monitoring when searches are done in a short timeframe (like within a second of the last one). That's a sure sign of a non-human, but I'd rather nip the problem in the bud if I can. Dan Heller http://www.danheller.com	[reply]
Re^3: perl regex or module that identifies bots/crawlers by Sartan (Pilgrim) on Mar 20, 2007 at 23:40 UTC
The company I work for gets over 9 million hits per day. Many are from bots but bots typically just act like regular users in order to crawl your site for caching/search/things like that. I would take some time to look at your code to see what is going on. 25K visitors causing a load of over 200 points to something other then just spiders crawling your site. What would happen if you had 25K valid users? You're site crashes? That's probably not what you want. Where I work I use a combination of ip address and user agent string to identify search engines. We don't block them per say(we still want to show up in google searches) we just don't give them shopping carts or do other tracking type work with them D	[reply]
Re^3: perl regex or module that identifies bots/crawlers by CountZero (Bishop) on Mar 20, 2007 at 22:58 UTC
I never realised that robots could have such an impact on a web-site. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^3: perl regex or module that identifies bots/crawlers by UnderMine (Friar) on Mar 23, 2007 at 10:24 UTC
I had a similar problem a few years ago when I was using session tokens embedded in the URL. Due to the nature of the site the token related to versioned session data hence the token could branch. This acted as a wonderful spider trap as the URL were always different if it tried to retrack its steps and use another path that it had already tested. Got round this in the end by analysing the speed at which sessions were being updated and using that as a bot detector. Generally it is only impolite bots that are an issue. One request a minute is Ok but 50 a sec is not. If you can't do it on a session basis you might want to look at apache throttling. This is not ideal as you my well end up throttling everyone not just the bots. Trouble is a bot can use rotating IPs, disguised/changing user agents or anon-proxy servers to hide what it is. UnderMine OS History - Operating system history	[reply]
Re: perl regex or module that identifies bots/crawlers by Moron (Curate) on Mar 22, 2007 at 17:12 UTC
What about encrypting the originating IP address + timestamp and put it at the end of the URL for all valid navigation. Bots and crawlers won't be able to obey your protocol. Your cgis can run a routine (you would write) to decrypt and check the relevant part of whatever URL they were run with and abusers can be autodetected that way. As for retaliation, immediate blocking is just too nice. Having set a daily quota for an IP address to bareback your cgis, better to redirect exceeders to the website of your least favourite government organisation - that way they can go spin each other ;) But block above some other threshold you can now afford. Careful selection of redirect (e.g. to an ftp variant of some gov. agency URL) will keep the bot busy with the other 'enemy' machine (whatever you define that to be) before it reruns against your URLs and you can therefore afford a higher threshold for blocking rather than redirect under those circumstances. You could also use this middle threshold to log the activity as has already been suggested for a general response. -M Free your mind	[reply]
Re: perl regex or module that identifies bots/crawlers by warkruid (Novice) on Mar 22, 2007 at 15:57 UTC
Mmmm.. started wondering. Couldn't you use something like wpoison (www.monkeys.com/wpoison) to generate an ipblacklist from the crawlers that ignore the robot exclusion protocol and use this blacklist to dynamically update the firewall rules? wpoison generates pages that are clearly marked as "off limits" to crawlers, so anything that would follow a wpoison generated page for more than (say) 2 levels would be a valid candidate for blacklisting. Interesting.. Trying to identify crawlers by signatures is ultimately a losing battle. I've been down that road with spam. Blocking them when they trespass seems a better alternative to me.	[reply]