tracking unique users in a weblog

Earindil has asked for the wisdom of the Perl Monks concerning the following question:

I have a custom monitor written that runs every five minutes on several web servers. The very first time it runs it gets the timestamp from the last line of the access file. Five minutes later, it opens the log file, reads it line by line till it gets to that timestamp, and then tallys up the unique users from that point forward.
Some of these logs are very large and the process takes close to 10 seconrds to run on them. I'd like to rewrite it and make it smarter. I started to do this using seek and tell but then hit a snag.
What happens if the log rotates? I'll have to watch for that. I was thinking I could do this by keeping the first line of the log file and if it doesn't match on the next run then I know it's a new log.
My questions. Is seek and tell significantly faster than simply walking through the file with a while loop until I get to where I need to be?
Could someone give me a sample of how I would rewrite this part to do this? I'd appreciate it. I hate creating a bigger than necessary CPU hit on the server.
I was thinking at a minimum, my "next if" statement could be a lot better.
Thanks bunches,
Derek

        $FLAG = 0;
        # Get last timestamp
        if (-e "$OutputDir/$NAME/last_timestamp") {
                open LAST, "<$OutputDir/$NAME/last_timestamp";
                $last_timestamp = <LAST>;
                chomp $last_timestamp;
                close LAST;
        }
        else {
                # Prime the Pump
                $data = `tail -1 $LOG`;
                @data = split(/\ /,$data);
                open LAST, ">$OutputDir/$NAME/last_timestamp";
                print LAST "$data[3]";
                close LAST;
                exit;
        }
        if (-e "$LOG") {
                open LOG, "<$LOG";
                while (<LOG>) {
                        @data = split(/\ /);
                        next if (($data[3] ne $last_timestamp)&&($last
+_timestamp)&&(!$FLAG));
                        $FLAG=1;
                        next if (($skip)&&($_ =~ m/$skip/));
                        $last_timestamp = $data[3];
                        $login_id{$data[2]}++;
                }
                close LOG;
                @Result = keys %login_id;
                print "USERS=$#Result\n";
                open LAST, ">$OutputDir/$NAME/last_timestamp";
                print LAST "$last_timestamp";
                close LAST;
        }
        else {
                print "LOG FILE NOT FOUND\n";
        }
[download]

Comment on tracking unique users in a weblog Download Code

Replies are listed 'Best First'.
•Re: tracking unique users in a weblog by merlyn (Sage) on Jan 22, 2004 at 17:05 UTC
Two points. First, File::Tail rocks. Get it, use it. Second, you keep using the word "user", but I don't know how you can know that, unless every user is being tracked by a unique cookie or something. I'd certainly hope you aren't confusing the words "IP address" with the word "user", because they are not the same. Many users share the same IP addresses. Many users have multiple IP addresses. An extreme case of that is the largest single group of internet users in the world: AOL. A single page hit from an AOL user can appear to come from many proxy servers. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: •Re: tracking unique users in a weblog by Earindil (Beadle) on Jan 22, 2004 at 17:28 UTC
I'm using the login ids stored in the access log. This is a secure website. I'll take a look at File::Tail, I glanced at it earlier but thought it was for keeping a process running and simulating a tail -f on it. I need to be able to kick this process in every 5 mins and can't keep it running.	[reply]
Re: tracking unique users in a weblog by Abigail-II (Bishop) on Jan 22, 2004 at 17:12 UTC
Is seek and tell significantly faster than simply walking through the file with a while loop until I get to where I need to be? Uhm, yes. For the same reason that it's faster to do: `my $element = $array [$big_num];` [download] than to do: `my ($element) = grep {$_ == $big_num} 0 .. $#array;` [download] With a seek, you do a simple operation once. With a line-by-line scan, you do a complex operation many, many times. Personally, I'd have the webserver log to a database. Abigail	[reply] [d/l] [select]
(z) Re: tracking unique users in a weblog by zigdon (Deacon) on Jan 22, 2004 at 17:49 UTC
First, I'd try to write this a a daemon, as merlyn hints. If that isn't possible, then using seek will defenitly be a lot faster than reading each line in the long, spliiting, it, and comparing it with the timestamp you need. Something like this (untested) code: # get last known position of EOF if (open(SEEK, "<seek") { $seek = <SEEK>; chomp $seek; close SEEK; } else { # no EOF recorded, find it warn "Failed to read seek: $!\nCreating\n"; open(LOG, "<$log") or die "Can't read $log: $!"; # always a good idea to check the result of your open! seek(LOG, 0, 2); # jump to the EOF $seek = tell(LOG); close LOG; &write_seek($seek); exit; } open(LOG, "<$log") or die "Can't read $log: $!"; seek(LOG, $seek, 0); # jump to the last EOF while (<LOG>) { $id = (split / /)[2]; $login_id{$id}++; } $seek = tell(LOG); &write_seek($seek); close LOG; sub write_seek { my $seek = shift; open(SEEK, ">seek") or die "Can't write to seek: $!"; print SEEK $seek; close SEEK; } [download] Remember, this code has not been even compiled. -- zigdon	[reply] [d/l]
Re: (z) Re: tracking unique users in a weblog by Earindil (Beadle) on Jan 22, 2004 at 18:47 UTC
Works great. Fixed minor bug in your first open and added code to test to see if the file has been rotated. I think this will work well. Much better than my original idea of comparing the first line of the files. I simply test to see if the old end of file seek is larger than the new end of file seek. thanks much for the assist { # get end of log file open(LOG, "<$LOG") or die "Can't read $LOG: $!"; seek(LOG, 0, 2); # jump to the EOF $new_seek = tell(LOG); close LOG; # get last known position of EOF if (open(SEEK, "<$OutputDir/$NAME/last_seek")) { $seek = <SEEK>; chomp $seek; close SEEK; } else { # no EOF recorded, find it # warn "Failed to read seek: $!\nCreating\n"; &write_seek($new_seek); exit; } if ($seek > $new_seek) { # New Log File $seek = 0; } open(LOG, "<$LOG") or die "Can't read $log: $!"; seek(LOG, $seek, 0); # jump to the last EOF while (<LOG>) { $id = (split / /)[2]; $login_id{$id}++; } $seek = tell(LOG); &write_seek($seek); close LOG; @Result = keys %login_id; print "USERS=$#Result\n"; } sub write_seek { my $seek = shift; open(SEEK, ">$OutputDir/$NAME/last_seek") or die "Can't write +to seek: $!"; print SEEK $seek; close SEEK; } [download]	[reply] [d/l]
Re: (z) Re: tracking unique users in a weblog by Earindil (Beadle) on Jan 22, 2004 at 18:21 UTC
Thanks, I'll try to implement this on a test box and see how it goes. As for checking to see if the log file has rotated, any suggestions there other than what I mentioned earlier? Checking if the first line of the file is still the same.	[reply]
Re: tracking unique users in a weblog by CountZero (Bishop) on Jan 22, 2004 at 22:33 UTC
To see if we are dealing with a newer file, I would check the creation date of the file, which I can get on my Windows XP with `scalar localtime((stat 'myfile')[10])`. Of course I'm assuming that the log-rotate function makes a new file with a new creation date and your file system allows you to check the creation date of the file. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l]