rlb3 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to figure out a way to sort large apache log
files in perl. I've got something worked out but I
think it can be made better. Here is what I have so far:
#!/usr/bin/perl -w use Date::Calc qw|Date_to_Time Decode_Month|; foreach $file (@ARGV) { die("This is not a file: $!") unless (-f $file); open(FILE, $file); push(@logs,<FILE>); } @sorted= map {$_->[0]} sort {$a->[1] <=> $b->[1]} map{ chomp; my $stamp = get_timestamp($_); [$_,$stamp] } @logs; foreach (@sorted) { print "$_\n"; } sub get_timestamp { my $line = shift; $line =~ /\[(.*) -\d+\]/; my $tempdate = $1; my ($day,$mon,$year,$hour,$min,$sec) = $tempdate =~ /(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+)/; $mon = Decode_Month($mon); my $stamp = Date_to_Time($year,$mon,$day,$hour,$min,$sec); }
Anything you have to offer would be great. rlb3

Replies are listed 'Best First'.
Re: Sorting apache log files
by joealba (Hermit) on Sep 30, 2002 at 21:22 UTC
    Well, either dws is right on the money (which is my guess too), or somehow you are really stuck in this ugly situation where you have a big, unsorted Apache log file. Say, perhaps, you are tossed a big concatenated log file from a cluster of 5 machines. guh..

    After kicking the sysadmin who gave you such an ugly problem, I would first split the file down into smaller chunks (files, in this case) that your machine can handle. Here's a way to quickly split the file into separate files by month.
    # untested. copy/paste errors from working code are possible. :) use FileHandle; my %files = (); while (<INPUT>) { # you can open the file by yourself :) # Get date of log line my $date; if (m|^[^\[]+\[\d+/(\w+)/(\d+)|) { $date = ("$1\_$2.log"); } else { next; } # reject bad log line if (! defined $files{$date}) { $files{$date} = new FileHandle; open $files{$date}, ">$date" or die "Couldn't open $date: $!\n"; } print { $files{$date} } $_; }
Re: Sorting apache log files
by dws (Chancellor) on Sep 30, 2002 at 20:28 UTC
    I've got something worked out but I think it can be made better.

    Better, in this case, might be to do nothing. Apache log files are already sorted by date/time.

    Assuming that you picked a poor example, and are thinking about sorting in general, consider how big the logfiles will get, and whether you might be better off preprocessing them so that a stand-alone sort program (one that knows how to cope with things that can't fit into available virtual memory) can be employed.

      Incidentally I was just thinking that if you really need to do your own sorting and you have issues with available memory you might consider checking out a Radix sort. The variation that saves you memory is when you write each slot (or perhaps a group of slots) out to an external file. If the partition is unsorted then just sort *that* and combine your partitions in order. Does anyone have a good reference on how a real person might implement Radix? I'd just refer back to Knuth's TAoCP vol2 but that's not for everyone.

      Better, in this case, might be to do nothing. Apache log files are already sorted by date/time.

      Not always. For example, one might choose to have different virtual hosts split out into different logs. rlb3 might simply be trying to concatenate different log files and then sort them. Not every person uses the same configuration, you know. :)

Re: Sorting apache log files
by PodMaster (Abbot) on Oct 01, 2002 at 00:42 UTC
    What I'd do is simply use DB_File like so (untested, but hey, I wrote it ;))
    #!/usr/bin/perl -w use strict; use Fcntl; use DB_File; use Date::Calc qw|Date_to_Time Decode_Month|; my $filename = __FILE__.".${$}.db"; # $DB_BTREE is exported by DB_File # it is a sorted balanced binary tree # we enable duplicate keys $DB_BTREE->{'flags'} = R_DUP ; # since default sorting is lexical, and we want numeric $DB_BTREE->{'compare'} = sub { my( $keyA, $keyB ) = @_; return $keyA <=> $keyB; }; my $X = tie my(%H), "DB_File", $filename, O_RDWR|O_CREAT|O_TRUNC, 0666, $DB_BTREE or die "Cannot open $filename: $!\n"; foreach $file (@ARGV) { die("This is not a file: $!") unless (-f $file); open(FILE, $file) or die "WTF? couldn't open $file cause $!"; while(<FILE>){ chomp; my $date = get_timestamp($_); $H{$date} = $_; } close(FILE); } # iterate through the btree using seq # and print each key/value pair. my $key = 0; my $value = 0 ; my $status = ""; for( $status = $x->seq($key, $value, R_FIRST) ; $status == 0 ; $status = $x->seq($key, $value, R_NEXT) ) { print "$value\n"; } undef $X ; untie %H ; # delete our temporary database unlink $filename or die "couldn't delete $filename cause $!"; exit; ## subland sub get_timestamp { my $line = shift; $line =~ /\[(.*) -\d+\]/; my $tempdate = $1; my ($day,$mon,$year,$hour,$min,$sec) = $tempdate =~ /(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+)/; $mon = Decode_Month($mon); my $stamp = Date_to_Time($year,$mon,$day,$hour,$min,$sec); }
    I personally wouldn't bother with all that unneccessary time consuming Date::Calc nonsense, and if you do that, you wouldn't have to bother with the custom compare routine.

    On another note, your get_timestamp sub has a classic logic flaw, you assign $1 to something, without ever knowing if there was a match. You should if(/(match)){ $foo = $1; ... }

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      I would like to thank every one for their input. All the
      comments were helpful. This was more a programming
      exercise than anything else, but I have run into apache
      logs that were in a mess that needed to be sorted. It
      was just an idea I had and I wanted to flesh it out.
      Again thanks to everyone.

      rlb3