in reply to perl regex or module that identifies bots/crawlers

I've found that some really rotten bots provide user agent strings claiming they're googlebot or slurp, but their IP addresses are nowhere near either of those two companies. The ones I've found being most abusive don't seem to move around much in IP space, so I just drop their packets via iptables and let it go at that. Because I have some traps set up on my web site, I check my server logs regularly to see who's wandered into globally excluded directories, and every agent claiming to be googlebot or slurp who's gone there has been an imposter.

I do reject a few user agent strings, though. "PHP Script", "MSRBOT", and "Java*" are denied via (my) Apache's configuration because they're commonly found attempting to abuse my web-to-mail gateway.

To answer the question you asked: I'm not aware of a ready-made module that will do what you want done.

  • Comment on Re: perl regex or module that identifies bots/crawlers

Replies are listed 'Best First'.
Re^2: perl regex or module that identifies bots/crawlers
by Sartan (Pilgrim) on Mar 21, 2007 at 22:26 UTC

    Thus my earlier suggestion of matching ips(here's one range for slurp I use..there are more /^72\.30\.\d+\.\d+$/ ) with user agents (/slurp/).

    It tends to do us justice.

    --D