Break up weblogs

xorl has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a very large Apache weblog. What I want to do is extract all the hits for each department's directory and put them in its own file. I thought of a couple of ways to do this, but none seem to be all that great. First idea was:

my @dir_list = qw(dept1 dept2 dept3); # actually there are over 40 dep
+artments in this list

foreach my $dept (@dir_list) {

     open (DEPTLOG, "+>/data/logs/" . $dept . "current.log");

     open (LOGFILE, "/data/logs/access.log");

     while (<LOGFILE>) {

          if (/$dept/) {

          print DEPTLOG $_;

     }

close(LOGFILE);
close(DEPTLOG);


}
[download]

The other way I thought was to store each log in a hash first then wirte it to the DEPTLOG

my @dir_list = qw(dept1 dept2 dept3); # actually there are over 40 dep
+artments in this list

my %logs;

open (LOGFILE, "/data/logs/access.log");

while (<LOGFILE>) {

     foreach my $dept (@dir_list) {

          if (/$dept/) {

          $logs{$dept} .= $_;

     }

}

close(LOGFILE);

foreach my $dept (@dir_list) {

     open (DEPTLOG, "+>/data/logs/" . $dept . "current.log");
     print DEPTLOG $logs{$dept};
     close (DEPTLOG);



}
[download]

As I said neither of these seem very good. They both take forever to process. Is there a quicker way to handle this? Thanks.

Comment on Break up weblogs Select or Download Code

Replies are listed 'Best First'.
Re: Break up weblogs by BrowserUk (Patriarch) on Aug 09, 2004 at 15:37 UTC
In the first case you are processing the log 40+ times. In the second case you are are accumulating a lot of data in memory. The third option is to open the 40 output files and process the log once, writing to the appropriate file as you determine it. Something like this (untested). #! perl -slw use strict; my @deptids = qw[ dept1 dept2 dept3 ]; my %fh; open $fh{ $_ }, '+>', "/data/logs/${_}current.log" or die "$_: $!" for @deptids; open (LOGFILE, "/data/logs/access.log") or die $!; while( defined( my $line = <LOGFILE> ) ) { my $match; for( @deptids ) { $match = $_ and last if $line =~ m[\Q$_] }; if( $match ) { ## Note: The {}s around the file handle are required. print { $fh{ $match } } $line; } else { print STDERR "'$line' didn't match any dept"; } } close $_ for values %fh; close LOGFILE; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re: Break up weblogs by Fletch (Bishop) on Aug 09, 2004 at 15:19 UTC
Create an hash of department => IO::File object for that department's log. Read through the main log, determinining what department the entry is for and printing to the corresponding object. `my %handles; for my $dept ( @departments ) { $handles{ $dept } = IO::File->new; $handles{ $dept }->open( "> $logdir/$dept.log" ) or die "Open failed +: $!\n"; } while( <INLOG> ) { my $dept = divine_department( $_ ); $handles{ $dept }->print( $_ ); }` [download]	[reply] [d/l]
Re: Break up weblogs by xorl (Deacon) on Aug 09, 2004 at 16:11 UTC
Thanks everyone for such quick responces. It was was exactly what I was looking for.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Break up weblogs by Old_Gray_Bear (Bishop) on Aug 09, 2004 at 15:43 UTC
In your first example you are scanning the Apache log in its entirety for each department in your list. That's 40 passes.... Fletch's solution will reduce that to a single pass, it will be faster. Your second solution got the log procesesing down to a single pass and then uses memory to hold the data. For "small" logs this will work, but you will run out of memory as the log gets larger. The real solution, as Fletch pointed out, is to write the data once you have determined where it should go into an extract file (one per department). The code Fletch proposes will scale nicely, as you add more departments (another bonus). As to the amount of time it takes, you said "I have a very large Apache log...". There is a basic Principle of Science to bear in mind here: TTT -- Things Take Time. ---- I Go Back to Sleep, Now. OGB	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

TTT -- Things Take Time.