ciryon has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow Monks!

I'm trying to go through some large (2.8 GB) log files from IIS. Each line looks like:

80.129.152.192 - - [09/Dec/2001:00:23:26 0100] "GET /informationRoot/foo/bar/foo.gif HTTP/1.1" 200 631 I want to find out the number of unique visits for each log file. My code takes the IP from each line and counts it and adds it to a list (if it's not allready there, in case it just skips the line). This is not a good solution and it's extremely time consuming.

Here's the code anyway:

if ($ARGV[0]) { go(); } else { die "\nUsage: stats.pl [filename] [filename} ...\n\n"; } sub go() { foreach $filename (@ARGV) { $file = $filename; open (FILE, $file); $i = 0; my @list=""; while (<FILE>) { /(.*)\s-\s-/; $ip = $1; if (notInList($ip)) { $i++; addToList($ip); } } print "\nVisits for $file is $i\n"; } } # Subfunctions sub addToList($ip) { push @list,$ip; } sub notInList($ip) { foreach $tmpip (@list) { if ($tmpip eq $ip) { return 0; last;} } return 1; }

Anyone have suggestions for improving this?

Edit dws code tags</code>

Replies are listed 'Best First'.
•Re: Unique visits - Webserver log parser
by merlyn (Sage) on Feb 27, 2002 at 08:59 UTC
    Anyone have suggestions for improving this?
    Yes, give up on this idea:
    I want to find out the number of unique visits for each log file.
    An IP address is not a "visitor". Many users hide behind a single IP for a proxy, and a single user can come from multiple IPs for large proxies like AOL's proxy.

    There are no visitors, only hits. See one of Alan Flavell's messages for a fuller explanation. Just search for "Flavell visitors" for various takes by this legendary Web Expert about the futilty of your task.

    You might as well use Perl's rand function instead.

    -- Randal L. Schwartz, Perl hacker

      Sorry, hits is what I meant.
      ciryon, I'd argue that if you're regularly generating multi-gig log files you need a more high-powered solution than simple IP address analysis. Consider using a service like WebTrends, or rolling your own. If you can create a fast, secure and accurate WebTrends clone for your local site, you'll have done something impressive (a little futile, perhaps, since WebTrends is cheap, but it will be fun).

      Merlyn, while I can't argue with you about code (and your enum example below is nice), I think you're exaggerating Alan Flavell's views as he expressed them. He didn't say (in that message or anything else Google could find for me) that "there are no visitors" or that "IPs are meaningless."

      Your assertion that "there are no visitors, only hits" is wrong on its face. The vast majority of web users accept 3rd-party cookies, and services like WebTrends do a spectacular job of tracking first-time, returning and unique visitors.

      Can you determine exact unique visitors from log files using IP addresses only? No. Should you use IP addresses to identify users or sessions, or as part of a security process? No. These tasks are either futile, dangerous, or both.

      But can you use IP addresses to get a "pretty good" idea of first-time, returning and unique visitors? Yes. There are better methods, but they're much more complex. As long as you know your results won't be very accurate, munging a log file with Perl can be a good, cheap solution. Plus it can be a good exercise, especially for a self-described newbie. So what if AOL users are proxied? They don't all use the same proxy at the same time; you can time sessions out after X minutes and improve your accuracy a bit.
      You might as well use Perl's rand function instead.
      I know this is just hyperbole on your part, but I think it's a disservice to ciryon. He's a novice asking for advice, and I think we owe him honesty.
      --
      man with no legs, inc.
Re: Unique visits - Webserver log parser
by demerphq (Chancellor) on Feb 27, 2002 at 09:58 UTC
    Well, since merlyn has given you an outline of the futility of the task you have set yourself ill throw you the solution to making your futility possible. ;-)

    You need a hash. A hash is an associative array which means it associates a key with a value. Also there are some minor booboos in your code. Such as no strictures or warnings, no variable declarations and using .* when you shouldnt. Please read


    my %ips; # This is a hash while (<FILE>) { # Read em in $ips{$1}++ if (/^(\d+\.\d+\.\d+\.\d+)\s-\s-/); # no dot star # I make no promises that the above is a valid # IP matcher. It will match IPs and other # things too.. } foreach my $ip (keys %ips) { # Print em out print "IP $ip $ips{$ip}\n"; }

    HTH (at least on a functional level, on an intentional level I think you should listen to merlyn....)

    Yves / DeMerphq
    --
    When to use Prototypes?

Re: Unique visits - Webserver log parser
by shotgunefx (Parson) on Feb 27, 2002 at 09:57 UTC
    Any problem that basically says, have I seen this element or how many times have I seen this is ideally suited for a hash.
    # Change while (<FILE>) { /(.*)\s-\s-/; $ip = $1; if (notInList($ip)) { $i++; addToList($ip); } } # To my %IPs_Seen = (); while (<FILE>) { # Add 1 to the key $1 (The match) $IPs_Seen{$1} ++ if m/^(.*)\s-\s-/; } #later print "\nVisits for $file is ",scalar(keys %IPs_Seen),"\n"; # To access the ips found. my @ips = keys %IPs_Seen;
    This leaves out a host of other issues but it's a first step. #1 Visitors and IPs don't have a 1 to 1 mapping (proxies/AOL/etc). It will probably save you a ALOT of subroutine calls and the scanning of the @iplist on a file that big though.
    What if you have a line that doesn't start with a valid ip? There are probably many other issues as well. That aside, there are many log parsing packages that are already out there. No need to reinvent the wheel unless of course you want the experience or want something different. Re-inventing wheels (or trying to) can be a great learning experience.

    -Lee

    "To be civilized is to deny one's nature."
Re: Unique visits - Webserver log parser
by Matts (Deacon) on Feb 27, 2002 at 12:18 UTC
    2.8GB huh? Thems some pretty monstrous log files. Gonna take you a while to churn through them...

    Well, I suspect what you'll find is that this isn't the only statistic you'll need, so why not shove the data into a database instead? Yes, this is going to be more time consuming in the short run, but it's fun to play with alternative suggestions...

    So here's some code that shoves a Combined log format into an SQLite table. Of course you'll double your hard disk requirements, and it'll take probably a good couple of hours to create the database, but once you've done that you can pull off some pretty neat queries.

    use strict; use DBI; use Time::Piece; use Fatal qw(open close); # parsing: # 213.20.65.52 - - [01/Jan/2002:09:30:38 +0000] "GET /img/bg.gif HTTP/ +1.1" 200 268 # "http://www.axkit.org/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows + NT 5.0; Q31 2461)" use constant IP => 0; use constant AUTH => 1; use constant DATE => 2; use constant REQUEST => 3; use constant STATUS => 4; use constant BYTES => 5; use constant REFERER => 6; use constant UA => 7; use constant METHOD => 8; use constant URI => 9; my $logfile = $ARGV[0] || die "Usage: $0 filename\n"; open(LOG, $logfile); my $dbh = DBI->connect("dbi:SQLite:$logfile.db","","", { AutoCommit => 0, RaiseError => 1 }); print "Dropping old table...\n"; eval { $dbh->do("DROP TABLE access_log"); }; print "Done\n"; $dbh->do(<<EOT); CREATE TABLE access_log ( when datetime not null, host varchar(255) not null, method varchar(10) not null, url varchar(500) not null, auth, browser, referer, status integer default 0, bytes integer ) EOT my $sth = $dbh->prepare(<<EOT); INSERT INTO access_log VALUES ( ?,?,?,?,?,?,?,?,? ) EOT my $line = 0; while (<LOG>) { chomp; # superfluous, but we do it anyway $line++; my @vals; # adjust the regexp depending on your log format if (/^([^ ]*) [^ ]*? ([^ ]*?) \[([^\]]*?)\] "(.*?)" ([^ ]*?) ([^ ] +*?) "(.*?)" "(.*?)"$/) { @vals = ($1,$2,$3,$4,$5,$6,$7,$8); } else { warn("Corrup log line: $_\n"); next; } eval { $vals[DATE] = Time::Piece->strptime($vals[DATE], '%d/%b/%Y:%H: +%M:%S +0000')->datetime; }; if ($@) { die "Failed to parse $vals[DATE] on line $line\n"; } if ($vals[REQUEST] =~ /^(\w+) ([^ ]+)/) { $vals[METHOD] = $1; $vals[URI] = $2; } else { warn "Couldn't parse: $vals[REQUEST] on line $line\n" if $vals[REQUEST] ne '-'; $vals[METHOD] = "INVALID_METHOD"; $vals[URI] = "-"; } # print join(':', @vals), "\n"; $sth->execute(@vals[DATE,IP,METHOD,URI,AUTH,UA,REFERER,STATUS,BYTE +S]); unless ($line % 1000) { print "Completed $line lines, committing.\n"; $dbh->commit; } } close LOG; $sth->finish; $dbh->disconnect;
    Comments on the code welcome.
      In place of
      use constant IP => 0; use constant AUTH => 1; use constant DATE => 2; use constant REQUEST => 3; use constant STATUS => 4; use constant BYTES => 5; use constant REFERER => 6; use constant UA => 7; use constant METHOD => 8; use constant URI => 9;
      you can put this (after installing enum):
      use enum qw(IP AUTH DATE REQUEST STATUS BYTES REFERER UA METHOD URI);
      Much nicer and less chance of making a mistake.

      -- Randal L. Schwartz, Perl hacker

        True enough... Why isn't enum.pm shipping with Perl 5.8??? It seems an ideal candidate to me. I didn't use it because I don't have it installed, though I have been looking at it lately because the code I'm refactoring has a lot of enum-like constants in it.

        Time to fire off an email to jarkko methinks.

        Update: cool, Jarkko has put it on the TODO list for 5.9.

Re: Unique visits - Webserver log parser
by rinceWind (Monsignor) on Feb 27, 2002 at 12:02 UTC
    nms has a text counter.