Re: log parsing very slow

Anonymous Monk,
It seems to me that there should be a module on CPAN that already knows how to parse whatever web log format you are trying to do. If you can't find something or it doesn't do what you need, it is fairly straight forward to fix your problem - add a %seen cache which will also serve as a counter. This prevents having to search through every previous entry to determine if the existing entry is unique or not.

use strict;
use warnings;
use File::Basename;

my (@count, %seen);
my @mask = qw(/some/path/ some/other/path another/path);

open (FH, '<', "access.log") or die "Unable to open 'access.log' for r
+eading: $!";

while (<FH>) {
    chomp;
    for my $m (@mask) {
        my $regex = "GET.*" . $m . ".*HTTP/1.1\" 200 [0-9].*";
        if (/$regex/) {
            s/.*GET //;
            s/ HTTP.*//;
            my $bn = basename($_);
            push @count, $bn if ! $seen{$bn}++;
        }
    }
}
print "$_ = $seen{$_}\n" for @count;
[download]

This code is untested but it should work. If you didn't care about preserving the order of the entries you could do away with the array all together.

Cheers - L~R

Update: Following the suggestion of others to use qr// external to the loop which will increase the performance of this solution even more. If you can combine the regexes using Regexp::Assemble, there will be an additional boost.

Comment on Re: log parsing very slow Download Code

Replies are listed 'Best First'.
Re^2: log parsing very slow by ChemBoy (Priest) on Oct 05, 2005 at 15:27 UTC
HTTPD::Log::Filter is such a module, though I'm not sure it does much for the specific problems the OP was after. On the other hand, those seem to have been adequately addressed elsewhere, so I'll just throw in a plug for the module (which has worked well for me in the past) and leave it at that. :-) If God had meant us to fly, he would never have given us the railroads. --Michael Flanders	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.