Unique visits - Webserver log parser

ciryon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
•Re: Unique visits - Webserver log parser by merlyn (Sage) on Feb 27, 2002 at 08:59 UTC
Anyone have suggestions for improving this? Yes, give up on this idea: I want to find out the number of unique visits for each log file. An IP address is not a "visitor". Many users hide behind a single IP for a proxy, and a single user can come from multiple IPs for large proxies like AOL's proxy. There are no visitors, only hits. See one of Alan Flavell's messages for a fuller explanation. Just search for "Flavell visitors" for various takes by this legendary Web Expert about the futilty of your task. You might as well use Perl's rand function instead. -- Randal L. Schwartz, Perl hacker	[reply]
Re: •Re: Unique visits - Webserver log parser by ciryon (Sexton) on Feb 27, 2002 at 10:24 UTC
Sorry, hits is what I meant.	[reply]
•Re: Re: •Re: Unique visits - Webserver log parser by merlyn (Sage) on Feb 27, 2002 at 15:03 UTC
If you now mean "hits", then you can ignore the IP, since that doesn't mean a thing. -- Randal L. Schwartz, Perl hacker	[reply]
not quite right (was: Unique visits - Webserver log parser) by legLess (Hermit) on Feb 28, 2002 at 05:33 UTC
ciryon, I'd argue that if you're regularly generating multi-gig log files you need a more high-powered solution than simple IP address analysis. Consider using a service like WebTrends, or rolling your own. If you can create a fast, secure and accurate WebTrends clone for your local site, you'll have done something impressive (a little futile, perhaps, since WebTrends is cheap, but it will be fun). Merlyn, while I can't argue with you about code (and your `enum` example below is nice), I think you're exaggerating Alan Flavell's views as he expressed them. He didn't say (in that message or anything else Google could find for me) that "there are no visitors" or that "IPs are meaningless." Your assertion that "there are no visitors, only hits" is wrong on its face. The vast majority of web users accept 3rd-party cookies, and services like WebTrends do a spectacular job of tracking first-time, returning and unique visitors. Can you determine exact unique visitors from log files using IP addresses only? No. Should you use IP addresses to identify users or sessions, or as part of a security process? No. These tasks are either futile, dangerous, or both. But can you use IP addresses to get a "pretty good" idea of first-time, returning and unique visitors? Yes. There are better methods, but they're much more complex. As long as you know your results won't be very accurate, munging a log file with Perl can be a good, cheap solution. Plus it can be a good exercise, especially for a self-described newbie. So what if AOL users are proxied? They don't all use the same proxy at the same time; you can time sessions out after X minutes and improve your accuracy a bit. You might as well use Perl's rand function instead. I know this is just hyperbole on your part, but I think it's a disservice to ciryon. He's a novice asking for advice, and I think we owe him honesty. -- man with no legs, inc.	[reply]
Re: Unique visits - Webserver log parser by demerphq (Chancellor) on Feb 27, 2002 at 09:58 UTC
Well, since merlyn has given you an outline of the futility of the task you have set yourself ill throw you the solution to making your futility possible. ;-) You need a hash. A hash is an associative array which means it associates a key with a value. Also there are some minor booboos in your code. Such as no strictures or warnings, no variable declarations and using .* when you shouldnt. Please read 101 reasons to use strict; Use strict and warnings Why use strict/warnings? Why We Use Strict! Death to Dot Star! `my %ips; # This is a hash while (<FILE>) { # Read em in $ips{$1}++ if (/^(\d+\.\d+\.\d+\.\d+)\s-\s-/); # no dot star # I make no promises that the above is a valid # IP matcher. It will match IPs and other # things too.. } foreach my $ip (keys %ips) { # Print em out print "IP $ip $ips{$ip}\n"; }` [download] HTH (at least on a functional level, on an intentional level I think you should listen to merlyn....) Yves / DeMerphq -- When to use Prototypes?	[reply] [d/l]
Re: Unique visits - Webserver log parser by shotgunefx (Parson) on Feb 27, 2002 at 09:57 UTC
Any problem that basically says, have I seen this element or how many times have I seen this is ideally suited for a hash. `# Change while (<FILE>) { /(.)\s-\s-/; $ip = $1; if (notInList($ip)) { $i++; addToList($ip); } } # To my %IPs_Seen = (); while (<FILE>) { # Add 1 to the key $1 (The match) $IPs_Seen{$1} ++ if m/^(.)\s-\s-/; } #later print "\nVisits for $file is ",scalar(keys %IPs_Seen),"\n"; # To access the ips found. my @ips = keys %IPs_Seen;` [download] This leaves out a host of other issues but it's a first step. #1 Visitors and IPs don't have a 1 to 1 mapping (proxies/AOL/etc). It will probably save you a ALOT of subroutine calls and the scanning of the @iplist on a file that big though. What if you have a line that doesn't start with a valid ip? There are probably many other issues as well. That aside, there are many log parsing packages that are already out there. No need to reinvent the wheel unless of course you want the experience or want something different. Re-inventing wheels (or trying to) can be a great learning experience. -Lee "To be civilized is to deny one's nature."	[reply] [d/l]
Re: Unique visits - Webserver log parser by Matts (Deacon) on Feb 27, 2002 at 12:18 UTC
2.8GB huh? Thems some pretty monstrous log files. Gonna take you a while to churn through them... Well, I suspect what you'll find is that this isn't the only statistic you'll need, so why not shove the data into a database instead? Yes, this is going to be more time consuming in the short run, but it's fun to play with alternative suggestions... So here's some code that shoves a Combined log format into an SQLite table. Of course you'll double your hard disk requirements, and it'll take probably a good couple of hours to create the database, but once you've done that you can pull off some pretty neat queries. use strict; use DBI; use Time::Piece; use Fatal qw(open close); # parsing: # 213.20.65.52 - - [01/Jan/2002:09:30:38 +0000] "GET /img/bg.gif HTTP/ +1.1" 200 268 # "http://www.axkit.org/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows + NT 5.0; Q31 2461)" use constant IP => 0; use constant AUTH => 1; use constant DATE => 2; use constant REQUEST => 3; use constant STATUS => 4; use constant BYTES => 5; use constant REFERER => 6; use constant UA => 7; use constant METHOD => 8; use constant URI => 9; my $logfile = $ARGV[0] \|\| die "Usage: $0 filename\n"; open(LOG, $logfile); my $dbh = DBI->connect("dbi:SQLite:$logfile.db","","", { AutoCommit => 0, RaiseError => 1 }); print "Dropping old table...\n"; eval { $dbh->do("DROP TABLE access_log"); }; print "Done\n"; $dbh->do(<<EOT); CREATE TABLE access_log ( when datetime not null, host varchar(255) not null, method varchar(10) not null, url varchar(500) not null, auth, browser, referer, status integer default 0, bytes integer ) EOT my $sth = $dbh->prepare(<<EOT); INSERT INTO access_log VALUES ( ?,?,?,?,?,?,?,?,? ) EOT my $line = 0; while (<LOG>) { chomp; # superfluous, but we do it anyway $line++; my @vals; # adjust the regexp depending on your log format if (/^([^ ]) [^ ]? ([^ ]?) \[([^\]]?)\] "(.?)" ([^ ]?) ([^ ] +?) "(.?)" "(.*?)"$/) { @vals = ($1,$2,$3,$4,$5,$6,$7,$8); } else { warn("Corrup log line: $_\n"); next; } eval { $vals[DATE] = Time::Piece->strptime($vals[DATE], '%d/%b/%Y:%H: +%M:%S +0000')->datetime; }; if ($@) { die "Failed to parse $vals[DATE] on line $line\n"; } if ($vals[REQUEST] =~ /^(\w+) ([^ ]+)/) { $vals[METHOD] = $1; $vals[URI] = $2; } else { warn "Couldn't parse: $vals[REQUEST] on line $line\n" if $vals[REQUEST] ne '-'; $vals[METHOD] = "INVALID_METHOD"; $vals[URI] = "-"; } # print join(':', @vals), "\n"; $sth->execute(@vals[DATE,IP,METHOD,URI,AUTH,UA,REFERER,STATUS,BYTE +S]); unless ($line % 1000) { print "Completed $line lines, committing.\n"; $dbh->commit; } } close LOG; $sth->finish; $dbh->disconnect; [download] Comments on the code welcome.	[reply] [d/l]
•Enumerated lists (was Re: Re: Unique visits - Webserver log parser) by merlyn (Sage) on Feb 27, 2002 at 15:07 UTC
In place of `use constant IP => 0; use constant AUTH => 1; use constant DATE => 2; use constant REQUEST => 3; use constant STATUS => 4; use constant BYTES => 5; use constant REFERER => 6; use constant UA => 7; use constant METHOD => 8; use constant URI => 9;` [download] you can put this (after installing enum): `use enum qw(IP AUTH DATE REQUEST STATUS BYTES REFERER UA METHOD URI);` [download] Much nicer and less chance of making a mistake. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re: ?Enumerated lists (was Re: Re: Unique visits - Webserver log parser) by Matts (Deacon) on Feb 27, 2002 at 16:22 UTC
True enough... Why isn't enum.pm shipping with Perl 5.8??? It seems an ideal candidate to me. I didn't use it because I don't have it installed, though I have been looking at it lately because the code I'm refactoring has a lot of enum-like constants in it. Time to fire off an email to jarkko methinks. Update: cool, Jarkko has put it on the TODO list for 5.9.	[reply]
Re: Unique visits - Webserver log parser by rinceWind (Monsignor) on Feb 27, 2002 at 12:02 UTC
nms has a text counter.	[reply]