in reply to Re: Parsing of the web log file, access_log
in thread Parsing of the web log file, access_log

Hey there is a bug. Two lines seem to be missing from the output. Bug is in...

$old = (0 == ($time - $start) % $period) ? $time : $old; ... my ($size , ... ) = (scalar @raw);

...which should have been...

$old += ($time - $old >= $period) ? $period : 0; ... my ($size , ...); $size += $_ foreach @raw;

One more try to get it right...

#!/usr/local/bin/perl -w use strict; =head1 Black Box Model Given the data... 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 +34906 ...with 10 minute interval (period size of I<10>, unit of I<minutes>), output is... Time Total pages Avg pages Min pages Max pages ------------------------------------------------------------------- 15/Jun/2003:13:05:00 5 1.0 1 2 15/Jun/2003:13:15:00 3 1.0 1 1 15/Jun/2003:13:25:00 1 1.0 1 1 15/Jun/2003:13:35:00 1 1.0 1 1 15/Jun/2003:13:45:00 1 1.0 1 1 15/Jun/2003:13:55:00 2 1.0 2 2 15/Jun/2003:14:05:00 1 1.0 1 1 15/Jun/2003:14:15:00 1 1.0 1 1 =cut use Time::CTime; use Time::ParseDate; use constant SEC_PER_MINUTE => 60; use constant MIN_PER_HOUR => 60; use constant SEC_PER_HOUR => SEC_PER_MINUTE * MIN_PER_HOUR; # skip unwanted files my $filter = sub { my $re = qr/ [.] (?: js | css | gif ) $/x; return ($_[0] =~ m/$re/) ? 1 : 0; }; # options my ($period_size , $unit , $log) = (10 , 'minute' , 'access_log_modified'); my $period = period_in_seconds($period_size , $unit); show_stat( collect_count($log , $period , $filter) ); =head1 C<show_stat($hash_ref_of_array_ref)> Given a hash reference of array references with keys as the time (in seconds), prints the time in human parsable time, and basic statistics for each array reference. =cut sub show_stat { my %parsed = %{ +shift }; printf "%-20s %11s %10s %9s %9s\n%s\n" , 'Time' , 'Total pages' , 'Avg pages' , 'Min pages' , 'Max pages' , '-' x (20 + 11 + 10 + 9 + 9 + (2 * 4)); my @periods = sort { $a <=> $b } keys %parsed; foreach my $k (@periods) { printf "%20s %11d %9.1f %9d %9d\n" , strftime( "%d/%b/%Y:%H:%M:%S" , localtime $k) , @{ basic_stat( $parsed{$k} ) }; } } =head1 C<$hash_of_array_ref = collect_count($file_name , $period , $co +de_ref)> Given a file name and time period (in seconds), returns hash reference with time in seconds as keys and array reference containing hits for each time value in the given period. Optional third parameter, a code reference (that takes file name and returns true), will be used to filter out the unwanted files if given. =cut sub collect_count { my ($log , $period , $filter) = @_; open(LOGFILE, '<' , $log) || die "Cannot read from $log: $!\n"; my ($start , $old , %count); $filter = sub { 0; } unless $filter; while ( <LOGFILE> ) { my ($time , $file) = (split / /)[3,6] or next; next if $filter->($file); next if $time !~ m/ \[ (.+?) \] /x; $time = parsedate($1); $old = $start = $time unless defined $start; $old += ($time - $old >= $period) ? $period : 0; push @{ $count{$old}->{$time} }, 1; } close(LOGFILE) || die "Could not close $log: $!\n"; return \%count; } =head1 C<$array_ref = basic_stat($hash_of_array_ref)> Given a hash reference of array references, returns an array reference composed of size, average, minimum, maximum based on the sizes of each array reference passed. It may return C<undef> values if passed hash is empty. =cut sub basic_stat { my $collection = shift; my @raw = map { scalar @{ $_ } } values %{$collection}; my ($size , $avg , $min , $max); $size += $_ foreach @raw; return [ $size , $avg , $min , $max ] unless $size; $avg = sub { my $sum; $sum += $_ foreach @raw; return $sum; }->() / $size +; $min = $max = $raw[0]; foreach ( @raw ) { $min = $_ if $min > $_; $max = $_ if $max < $_; } return [ $size , $avg , $min , $max ]; } =head1 C<$period = period_in_seconds($period_size , $unit)> Given period size and time unit, basically matching... C<m/hour | minute | second/xi> ...returns the period in seconds. If period size is not I<true>, returns 1. =cut sub period_in_seconds { my ($size , $unit) = @_; return 1 unless $size; $size = abs($size); my $multiplier = $unit =~ m/^ hour | hr /ix ? SEC_PER_HOUR : $unit =~ m/^min/i ? SEC_PER_MINUTE : 1; return $size * $multiplier; } __DATA__ 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:08:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906

Replies are listed 'Best First'.
Re: Re: Re: Parsing of the web log file, access_log
by Andy61 (Initiate) on Jun 24, 2003 at 18:48 UTC

    Sorry for the delayed response. Thanks a lot for that code. I was also working on this and managed to get it working.

    Before I clean up the code and post it here, the requirements have changed! :-(

    The idea is to start the processing of the log file by the quarter of hour. This means, if the first line has a timestamp something like 25/Apr/2003:13:54:02, we want to throw away all such records and consider processing from a record having a timestamp 25/Apr/2003:14:00:00. Well, the question is, what if we have no such record?What if the timestamp is 14:00:01?

    Good point. Then we want to start processing as if the record's timestamp is 14:00:00. Then consider an interval of 15 min from that timestamp.

    Sorry for the change in requirements!

    Thanks once again!
    Andy

      Well then, are you going to pay me? As i am unemployed, i can really use some USD.

      . . .

      Anyway ...

      1. First add the parse_date() sub from my earlier post. Rename it to something else, say "time_components", to avoid confusing it w/ Time::ParseDate::parsedate().
      2. Set $period_size and $unit as appropriate.
      3. When the $time is retrieved in the while loop, check if the minutes are mod ($period / SEC_PER_MINUTE).
        • If not, move to the next line.
        • Otherwise, let it be processed as usual.
      4. Giving you something like...
        while ( <FH> ) { ... next if $time !~ m/ \[ (.+?) \] /x; next # 2d last item is minute unless time_components($time)->[-2] % ($period / SEC_PER_MINUTE); ... }

        Hi Parv,
        Wish I could pay you! Sorry about your employment status! :-( Here is the Perl script that I created. This works for me and produces the output that I targetted for.

        To start processing records by the quarter of the hour, for example, 14:00:00, 14:15,..., I checked for (($minute%15) !=0), right at the beginning of the subroutine count_recs.
        #!/usr/bin/perl -w use strict; use Getopt::Long; use Time::Local; use Time::localtime; my $file="xaa"; my %increment = (); my $interval = 900; #An interval of 15 minutes my %MonthToNumber = (); @MonthToNumber{qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec) +} = map { sprintf "%02d", $_; } (0..11); count_recs(); sub count_recs { my ($Day, $Month, $Year, $Hour, $Minute, $Second); my ($datetime, $get_post); my $dmytime; my $dateproc; my @dates = (); my @dates_mod = (); open (INFILE, "<$file") || die "Cannot read from $file"; while (<INFILE>) { ($datetime,$get_post) = (split / /) [3,6]; $datetime =~ s/\[//; ($Day,$Month,$Year,$Hour,$Minute,$Second)= $datetime =~m#^(\d\d) +/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#; next if ($get_post =~ /\.js$/ || $get_post =~ /\.gif$/ || $get_pos +t =~ /\.css$/ || ($Minute%15) != 00); push (@dates, $datetime); } #outer while # Calculate the time in epoch seconds in the subroutine convert_epoch. + foreach $dmytime (@dates){ my $calc_seconds= &convert_epoch($dmytime); push (@dates_mod, $calc_seconds); } foreach $dateproc (@dates_mod) { $increment{$dateproc}++; } # Print the headings in the output file open (OUTFILE, ">>access_log.out") || die "Cannot write to $file"; print OUTFILE " Date Range of Time Visual Pages per seco +nd(VPPS)\n"; print OUTFILE " Avg Max Min\n"; print OUTFILE "------------ ----------------- ------------\n"; &calculate_time(); print OUTFILE "\n\n\n"; $interval = 3600; &calculate_time(); close(OUTFILE); close(INFILE); } sub convert_epoch { my $localdate = shift; my ($epoch_Day,$epoch_Month,$epoch_Year,$epoch_Hour,$epoch_Minute,$ +epoch_Second)= $localdate =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\ +d):(\d\d)#; my $epoch_seconds = timelocal($epoch_Second, $epoch_Minute, $epoch_ +Hour, $epoch_Day, $MonthToNumber{$epoch_Month}, $epoch_Year-1900); return $epoch_seconds; } sub calculate_time { my $first = 0; my $start_time; my $end_time; my $time; my ($max, $min, $avg, $count); my $partial=0; my $col_ind; my $processed; my $tm; foreach $processed ( sort keys %increment ) { if ($first == 0 ) { $start_time = $processed; $end_time = $start_time + $interval; $time = $start_time; $min = $increment{$processed}; $max = 0; $avg = 0; $first++; } if ($processed < $end_time) { if ($increment{$processed} < $min) { $min = $increment{$processed}; } if ($increment{$processed} > $max) { $max = $increment{$processed}; } $avg+= $increment{$processed}; $count++; $partial = 1; }else { $avg=$avg/$count; my $tm = localtime($end_time); my $start = localtime($start_time); if ($interval == 900) { $col_ind = 'P'; }else { $col_ind = 'H'; } printf OUTFILE "%01s %02d/%02d/%04d %02d:%02d:00-%02d:%02d:0 +0 %03.2f %03d %03d\n", $col_ind, ($tm->mon)+1,$tm->mday, $tm->year+19 +00,$start->hour, $start->min, $tm->hour,$tm->min, $avg, $max, $min; $min = $increment{$processed}; $max = $increment{$processed}; $avg = $increment{$processed}; $count = 1; $start_time = $processed; $time = $start_time; $end_time = $start_time + $interval; $partial = 0; } } if ( $partial == 1 ) { $avg = $avg/$count; $tm = localtime($end_time); my $start = localtime($start_time); printf OUTFILE "%01s %02d/%02d/%04d %02d:%02d:00-%02d:%02d:00 %0 +3.2f %03d %03d\n", $col_ind, ($tm->mon)+1,$tm->mday, $tm->year+1900,$ +start->hour, $start->min, $tm->hour,$tm->min, $avg, $max, $min; } }

        Pl. suggest any improvements you can think of. I know that I used the MonthToNumber (convert "Apr" to "04" and so on) as a Global. Thanks to you for that input.

        Also, I used the hash %increment as a global. May be I could pass it as an argument to the subroutine calculate_time?

        I needed to process two intervals of time, 15min and 1 hour. That's why I called the subroutine twice, with the interval used globally. Any thoughts?

        Performance wise, I got a 100MB file processed in about 1min 5 secs.

        My next step is to be able to process multiple input files. Currently, I hardcoded an input file. But, I would like to pass multiple log files as input and process them all.
        -Andy
        Hi Parv,
        How are you doing? Long time since I emailed you again. I have another question. In the output below, if there is data missing, I should still report the timestamp range and report Avg, Max, Min as 0s (zeroes). Basically what I mean is that I could have my input data something like:
        06/14/2003 06/14/2003 15:59:12 06/14/2003 17:00:02 06/14/2003 17:00:08 ....

        Currently, my output shows that gap in the data.
        Date Range of Time Visual Pages per second(VPPS) Avg Max Min ------------ ----------------- ------------ P 04/25/2003 14:00:00-14:15:00 1.28 006 001 P 04/25/2003 14:15:00-14:30:00 4.50 010 001 P 04/25/2003 14:30:00-14:45:00 5.19 014 001 P 04/25/2003 14:45:00-15:00:00 5.17 013 001 P 04/25/2003 15:00:00-15:15:00 5.60 013 001 P 04/25/2003 15:15:00-15:30:00 5.80 014 001 P 04/25/2003 15:30:00-15:45:00 5.90 014 001 P 04/25/2003 15:45:00-16:00:00 5.60 011 001 P 04/25/2003 17:00:00-17:15:00 5.85 015 001 P 04/25/2003 17:15:00-17:30:00 5.65 012 002 P 04/25/2003 17:30:00-17:45:00 5.95 022 001 P 04/25/2003 17:45:00-18:00:00 6.17 016 001 P 04/25/2003 18:00:00-18:15:00 6.05 012 001 P 04/25/2003 18:15:00-18:30:00 5.44 015 001 P 04/25/2003 18:30:00-18:45:00 5.67 015 001 P 04/25/2003 18:45:00-19:00:00 6.12 014 001 P 04/25/2003 19:00:00-19:15:00 6.16 014 001 P 04/25/2003 19:15:00-19:30:00 5.33 016 001 P 04/25/2003 19:30:00-19:45:00 4.83 012 001 H 04/25/2003 14:00:00-15:00:00 4.50 014 001 H 04/25/2003 15:00:00-16:00:00 5.73 014 001 H 04/25/2003 17:00:00-18:00:00 5.89 022 001 H 04/25/2003 18:00:00-19:00:00 5.84 015 001 H 04/25/2003 19:00:00-20:00:00 5.46 016 001

        But I still want to show the missing timestamps as:
        P 04/25/2003 16:00:00-16:15:00 0.00 000 000 P 04/25/2003 16:15:00-16:30:00 0.00 000 000 P 04/25/2003 16:30:00-16:45:00 0.00 000 000 P 04/25/2003 16:45:00-17:00:00 0.00 000 000 H 04/25/2003 16:00:00-17:00:00 0.00 000 000
        Shall appreciate help on this.
        Thanks

        Andy