Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask


by quartertone (Initiate)
on Sep 06, 2004 at 01:42 UTC ( [id://388697]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts/Text Processing/Miscellaneous
Author/Contact Info Gary C. Wang (gary at
Description: I always look at my Apache server log files from the command line. It always bothered me to see "GET /robots.txt" contaminating the logs. It was frustrating trying to visually determine which were crawlers and which were actual users. So I wrote this little utility, which filters out requests were made from IP addresses which grab "robots.txt". I suspect there are GUI log parsers that might provide the same functionality, but 1) i don't need something that heavy, 2) I like to code, 3) imageekwhaddyawant.
use strict;
use warnings;
# Apache logs robots filter-outer
# Author: Gary C. Wang
# Contact:
# Website:
# Filename: norobotlog
# Usage: norobotlog [logfile_name]
# This script parses Apache log files and 
#   filters out entries from IP addresses 
#   that request "robots.txt" file, commonly
#   associated with webcrawlers and site indexers.
# Prior to usage, check regexp to make sure it matches your log format
# My log format is something like:
#  192.168.0.xx - - [11/Jul/2004:22:25:22 -0400] "GET /robots.txt HTTP
+/1.0" 200 78

my %robots;
my $ip_ptn = '((\d{1,3}\.){3}\d{1,3})'; # this regexp matches IP addre
my @file = <>; #file from stdin

# First, find out which IPs are associated with crawlers
foreach (@file) {
    # ----- Adjust this pattern to match your log file -----
    $robots{$1} ++ if m/^$ip_ptn .+?robots\.txt/;

# Then weed those out, printing only the ones that do not request robo
foreach (@file) {
    if (m/$ip_ptn /) {
        print if ! defined $robots{$1};
Replies are listed 'Best First'.
Re: norobotlog
by sintadil (Pilgrim) on Sep 11, 2004 at 13:46 UTC

    It may be a good idea to include other bot patterns, like the Googlebot and other search engine bots. Otherwise, this can be simplified to an egrep command, which is what I'd use anyway.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://388697]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-13 10:01 GMT
Find Nodes?
    Voting Booth?

    No recent polls found