Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi there, i am a new perl person, but did a lot of php.

i am writing a web log parsing thingie. it is looking for a couple of regexes and counts some stuff, really simple. on a 200M logfile, it takes around 8.35s real 8.20s user 0.14s system with ONE $mask. as soon as i put 2 or more, the time goes over 100s... i tried to profile, but -d:DProf does not profile builtins.

the script is "learning" the filenames in a particlar path, and pushing them on the array. so before the pushing it checks if there is already a file with the same name.

$ cat test.pl use strict; use File::Basename; my @count = (); my @mask = ( '/some/path/', 'some/other/path', 'another/path', ); open F, "access.log"; while (<F>) { chomp; foreach my $m (@mask) { my $regex = "GET.*" . $m . ".*HTTP/1.1\" 200 [0-9].*"; if (/$regex/) { s/.*GET //; s/ HTTP.*//; my $bn = basename($_); my $found = 0; foreach (@count) { if ($_->[0] eq $bn) { $_->[1]++; $found = 1; last; } } if ($found == 0) { push @count, [ $bn, 1 ]; } } } } foreach (@count) { print "$_->[0] = $_->[1]\n"; } close F;
(all this on red hat ES3)

Replies are listed 'Best First'.
Re: log parsing very slow
by japhy (Canon) on Oct 05, 2005 at 13:35 UTC
    I'd suggest using a hash instead of (or in addition to) an array; you'll be able to keep track of things a whole lot faster. The other problem is that for each line of the file, you're compiling X regexes, where X is the number of elements in @mask. I would suggest creating regexes beforehand:
    my @regexes = map qr{GET (.*$_.*) HTTP/1.1" 200 [0-9]}, @mask;
    Using qr// gives us pre-compiled regexes, and I've added capturing parens to the regex so that the part in between "GET " and " HTTP" is captured to $1. Read on:
    my (%count, @order); while (<F>) { chomp; for my $rx (@regexes) { if (/$rx/) { my $bn = basename($1); $count{$bn}++ or push @order, $bn; } } }
    The %count hash tells you how many times a particular basename was seen, and the @order array keeps them in the order they were found. The only really tricky line is $count{$bn}++ or push @order, $bn which means: "If $count{$bn} is zero before we increment it, push $bn to the @order array." This means you won't get the same element in @order twice.

    Update: in retrospect, there's probably no harm in producing a single regex, since looping over the regexes can't provide any additional hits. That is, if regex 1 finds a match and regex 2 finds the same match, then regex 1+2 together will provide the same results.

    my $alts = join '|', map quotemeta, @masks; my $rx = qr{GET (.*(?:$alts).*) HTTP/1.1" 200 [0-9]}; my (%count, @order); while (<F>) { chomp; if (/$rx/) { my $bn = basename($1); $count{$bn}++ or push @order, $bn; } }

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      holy mother of perl, this is truly jawdropping.

      the script finishes in mere 12s... and you axed half of my script ;-)
      as i don't really care about the order of the entries, another array axed. brilliant.

      it is now clear for me that compiling regexes is more expensive than to fly to paris...

      i would like to thank the others as well for their time and effort.

        The two problems worked in tandem to screw you over bigtime. With more than one keyword being used, you had to compile a regex for EVERY line of the file and EVERY keyword. 200 lines and 2 keywords means 400 regex compilations (although you only really need TWO regexes). And using an array instead of a hash to keep track of uniqueness is, well, the road to madness.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: log parsing very slow
by Skeeve (Parson) on Oct 05, 2005 at 13:39 UTC
    Update: Corrected the RE.

    First thing to note: Use a hash for counting.
    foreach (@count) { if ($_->[0] eq $bn) { $_->[1]++; $found = 1; last; } } if ($found == 0) { push @count, [ $bn, 1 ]; }
    becomes
    ++$count{$bn};
    Next thing: Precompile your RE and do it for all masks:
    my $regex = join '|', @mask; # Now you have a list of alternatives $regex = qr<GET (.*\b(?:$regex)\b.*) HTTP/1.1" 200 [0-9].*>; # now the regex is precompiled
    (Untested!)

    Next thing: Why not replace while searchin. So
    if (/$regex/) { s/.*GET //; s/ HTTP.*//; : :
    is reduced to
    if (s/$regex/$1/) { : :
    Putting it all together and eliminating the unneccessary replace and chomp will give you this script, that should (UNTESTED) give the same result as yours, except for the sequence of filenames which will be sorted:

    $\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
Re: log parsing very slow
by Limbic~Region (Chancellor) on Oct 05, 2005 at 13:46 UTC
    Anonymous Monk,
    It seems to me that there should be a module on CPAN that already knows how to parse whatever web log format you are trying to do. If you can't find something or it doesn't do what you need, it is fairly straight forward to fix your problem - add a %seen cache which will also serve as a counter. This prevents having to search through every previous entry to determine if the existing entry is unique or not.
    use strict; use warnings; use File::Basename; my (@count, %seen); my @mask = qw(/some/path/ some/other/path another/path); open (FH, '<', "access.log") or die "Unable to open 'access.log' for r +eading: $!"; while (<FH>) { chomp; for my $m (@mask) { my $regex = "GET.*" . $m . ".*HTTP/1.1\" 200 [0-9].*"; if (/$regex/) { s/.*GET //; s/ HTTP.*//; my $bn = basename($_); push @count, $bn if ! $seen{$bn}++; } } } print "$_ = $seen{$_}\n" for @count;
    This code is untested but it should work. If you didn't care about preserving the order of the entries you could do away with the array all together.

    Cheers - L~R

    Update: Following the suggestion of others to use qr// external to the loop which will increase the performance of this solution even more. If you can combine the regexes using Regexp::Assemble, there will be an additional boost.

      HTTPD::Log::Filter is such a module, though I'm not sure it does much for the specific problems the OP was after. On the other hand, those seem to have been adequately addressed elsewhere, so I'll just throw in a plug for the module (which has worked well for me in the past) and leave it at that. :-)



      If God had meant us to fly, he would *never* have given us the railroads.
          --Michael Flanders

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: log parsing very slow
by Perl Mouse (Chaplain) on Oct 05, 2005 at 13:39 UTC
    By looping over your masks inside a loop over the lines over your log file, you are forcing to recompile a new regex for each line, for each mask. So, if you have 5 masks, and 100,000 lines in your log file, you are forcing Perl to compile 500,000 regular expressions.

    If you have one mask, Perl knows the regex hasn't changed, so it won't recompile - hence you will compile only one regex, despite your 100,000 lines of code.

    You can do two things - either switch the order of the loops , that is, for each mask, create a regex, then loop over the log file (first seeking back to the start of the file!), or create an array of compiled regular expressions, and loop over them.

    Perl --((8:>*