Re: Parsing of the web log file, access

Given the DATA at the end of the program, repeated below...

127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
[download]

...output is...

Time                  Total pages   Avg pages  Min pages  Max pages
-------------------------------------------------------------------
15/Jun/2003:13:05:00            6        1.2          1          2
15/Jun/2003:13:20:00            2        1.0          1          1
15/Jun/2003:13:35:00            1        1.0          1          1
15/Jun/2003:13:50:00            3        1.3          1          2
15/Jun/2003:14:20:00            1        1.0          1          1

I decided to create hash keys based on the interval size desired. That way, there will be no need to restructure the hash, or to do any other similar processing, by interval size. That saves from lugging around an array ref for each and every time event in the interval in the mean time.

  my ($start , $old , %count);

  while ( <LOGFILE> ) {
    my ($time , $file) = (split / /)[3,6] or next;
    ...
    $old = (0 == ($time - $start) % $period) ? $time : $old;

    push @{ $count{$old}->{$time} }, 1;
  }
[download]

Time::CTime::strftime() and Time::ParseDate::parsedate() come from Time-modules collection. Now the program (Jun 22 2003 1810: podified and somewhat restructured)...

#!/usr/local/bin/perl -w

use strict;

=head1 Black Box Model

Given the data...

  127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906
  127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 
+34906


...with 15 minute interval (period size of I<15>, unit of I<minute>), 
+output
is...

  Time                  Total pages   Avg pages  Min pages  Max pages
  -------------------------------------------------------------------
  15/Jun/2003:13:05:00            6        1.2          1          2
  15/Jun/2003:13:20:00            2        1.0          1          1
  15/Jun/2003:13:35:00            1        1.0          1          1
  15/Jun/2003:13:50:00            3        1.3          1          2
  15/Jun/2003:14:20:00            1        1.0          1          1

=cut

use Time::CTime;
use Time::ParseDate;

use constant SEC_PER_MINUTE => 60;
use constant MIN_PER_HOUR => 60;
use constant SEC_PER_HOUR => SEC_PER_MINUTE * MIN_PER_HOUR;

#  options
my ($period_size , $unit , $log) =
  (15 , 'minute' , 'access_log_modified');

#  skip unwanted files
my $filter =
  sub {
    my $re = qr/ [.] (?: js | css | gif ) $/x;
    return ($_[0] =~ m/$re/) ? 1 : 0;
  };

my $period = period_in_seconds($period_size , $unit);

show_stat( collect_count($log , $period , $filter) );


=head1 C<show_stat($hash_ref_of_array_ref)>

Given a hash reference of array references with keys as the time (in
seconds), prints the time in human parsable time, and basic statistics
for each array reference.

=cut

sub show_stat {
  my %parsed = %{ +shift };

  my @keys = sort { $a <=> $b } keys %parsed;

  printf "%-20s  %11s  %10s  %9s  %9s\n%s\n"
    , 'Time' , 'Total pages' , 'Avg pages' , 'Min pages' , 'Max pages'
    , '-' x (20 + 11 + 10 + 9 + 9 + (2 * 4));

  foreach my $k (@keys) {
    printf "%20s  %11d  %9.1f  %9d  %9d\n"
    , strftime( "%d/%b/%Y:%H:%M:%S" , localtime $k)
    , @{ basic_stat( $parsed{$k} ) };
  }
}

=head1 C<$hash_of_array_ref = collect_count($file_name , $period , $co
+de_ref)>

Given a file name and time period (in seconds), returns hash reference
with time in seconds as keys and array reference containing hits for
each time value in the given period.

Optional third parameter, a code reference (that takes file name and
returns true), will be used to filter out the unwanted files if given.

=cut

sub collect_count {
  my ($log , $period , $filter) = @_;

  open(LOGFILE, '<' , $log) || die "Cannot read from $log: $!\n";

  my ($start , $old , %count);

  $filter = sub { 0; } unless $filter;

  while ( <LOGFILE> ) {
    my ($time , $file) = (split / /)[3,6] or next;

    next if $filter->($file);
    next if $time !~ m/ \[ (.+?) \] /x;

    $time = parsedate($1);

    $start = $time unless defined $start;
    $old = (0 == ($time - $start) % $period) ? $time : $old;

    push @{ $count{$old}->{$time} }, 1;
  }

  close(LOGFILE) || die "Could not close $log: $!\n";

  return \%count;
}

=head1 C<$array_ref = basic_stat($hash_of_array_ref)>

Given a hash reference of array references, returns an array reference
composed of size, average, minimum, maximum based on the sizes of each
array reference passed.

It may return C<undef> values if passed hash is empty.

=cut

sub basic_stat {
  my $collection = shift;

  my @raw = map scalar @{ $_ } , values %{$collection};

  my ($size , $avg , $min , $max) = (scalar @raw);

  return [ $size , $avg , $min , $max ]
    unless $size;

  $avg =
    sub { my $sum; $sum += $_ foreach @raw; return $sum; }->() / $size
+;

  $min = $max = $raw[0];

  foreach ( @raw )
  { $min = $_ if $min > $_;
    $max = $_ if $max < $_;
  }

  return [ $size , $avg , $min , $max ];
}

=head1 C<$period = period_in_seconds($period_size , $unit)>

Given period size and time unit, basically matching...

C<m/hour | minute | second/xi>

...returns the period in seconds.

If period size is not I<true>, returns 1.

=cut

sub period_in_seconds {
  my ($size , $unit) = @_;

  return 1 unless $size;

  $size = abs($size);

  my $multiplier =
    $unit =~ m/^ hour | hr /ix
    ? SEC_PER_HOUR
    : $unit =~ m/^min/i
      ? SEC_PER_MINUTE
      : 1;

  return $size * $multiplier;
}

__DATA__
127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
[download]

Comment on Re: Parsing of the web log file, access_log Select or Download Code

Replies are listed 'Best First'.
Re: Re: Parsing of the web log file, access_log by parv (Parson) on Jun 22, 2003 at 22:30 UTC
Hey there is a bug. Two lines seem to be missing from the output. Bug is in... `$old = (0 == ($time - $start) % $period) ? $time : $old; ... my ($size , ... ) = (scalar @raw);` [download] ...which should have been... `$old += ($time - $old >= $period) ? $period : 0; ... my ($size , ...); $size += $_ foreach @raw;` [download] One more try to get it right... Read more... (9 kB)	[reply] [d/l] [select]
Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 24, 2003 at 18:48 UTC
Sorry for the delayed response. Thanks a lot for that code. I was also working on this and managed to get it working. Before I clean up the code and post it here, the requirements have changed! :-( The idea is to start the processing of the log file by the quarter of hour. This means, if the first line has a timestamp something like 25/Apr/2003:13:54:02, we want to throw away all such records and consider processing from a record having a timestamp 25/Apr/2003:14:00:00. Well, the question is, what if we have no such record?What if the timestamp is 14:00:01? Good point. Then we want to start processing as if the record's timestamp is 14:00:00. Then consider an interval of 15 min from that timestamp. Sorry for the change in requirements! Thanks once again! Andy	[reply]
Re: Re: Re: Re: Parsing of the web log file, access_log by parv (Parson) on Jun 24, 2003 at 20:32 UTC
Well then, are you going to pay me? As i am unemployed, i can really use some USD. . . . Anyway ... First add the `parse_date()` sub from my earlier post. Rename it to something else, say "time_components", to avoid confusing it w/ `Time::ParseDate::parsedate()`. Set `$period_size` and `$unit` as appropriate. When the `$time` is retrieved in the while loop, check if the minutes are mod `($period / SEC_PER_MINUTE)`. If not, move to the `next` line. Otherwise, let it be processed as usual. Giving you something like... `while ( <FH> ) { ... next if $time !~ m/ \[ (.+?) \] /x; next # 2d last item is minute unless time_components($time)->[-2] % ($period / SEC_PER_MINUTE); ... }` [download]	[reply] [d/l] [select]
Re: Re: Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 27, 2003 at 18:18 UTC
Re: Re: Re: Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 27, 2003 at 18:40 UTC
Re: Re: Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jul 11, 2003 at 11:16 UTC