in reply to Re: Re: Parsing of the web log file, access_log
in thread Parsing of the web log file, access_log


Sorry for the delayed response. Thanks a lot for that code. I was also working on this and managed to get it working.

Before I clean up the code and post it here, the requirements have changed! :-(

The idea is to start the processing of the log file by the quarter of hour. This means, if the first line has a timestamp something like 25/Apr/2003:13:54:02, we want to throw away all such records and consider processing from a record having a timestamp 25/Apr/2003:14:00:00. Well, the question is, what if we have no such record?What if the timestamp is 14:00:01?

Good point. Then we want to start processing as if the record's timestamp is 14:00:00. Then consider an interval of 15 min from that timestamp.

Sorry for the change in requirements!

Thanks once again!
Andy
  • Comment on Re: Re: Re: Parsing of the web log file, access_log

Replies are listed 'Best First'.
Re: Re: Re: Re: Parsing of the web log file, access_log
by parv (Parson) on Jun 24, 2003 at 20:32 UTC

    Well then, are you going to pay me? As i am unemployed, i can really use some USD.

    . . .

    Anyway ...

    1. First add the parse_date() sub from my earlier post. Rename it to something else, say "time_components", to avoid confusing it w/ Time::ParseDate::parsedate().
    2. Set $period_size and $unit as appropriate.
    3. When the $time is retrieved in the while loop, check if the minutes are mod ($period / SEC_PER_MINUTE).
      • If not, move to the next line.
      • Otherwise, let it be processed as usual.
    4. Giving you something like...
      while ( <FH> ) { ... next if $time !~ m/ \[ (.+?) \] /x; next # 2d last item is minute unless time_components($time)->[-2] % ($period / SEC_PER_MINUTE); ... }

      Hi Parv,
      Wish I could pay you! Sorry about your employment status! :-( Here is the Perl script that I created. This works for me and produces the output that I targetted for.

      To start processing records by the quarter of the hour, for example, 14:00:00, 14:15,..., I checked for (($minute%15) !=0), right at the beginning of the subroutine count_recs.
      #!/usr/bin/perl -w use strict; use Getopt::Long; use Time::Local; use Time::localtime; my $file="xaa"; my %increment = (); my $interval = 900; #An interval of 15 minutes my %MonthToNumber = (); @MonthToNumber{qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec) +} = map { sprintf "%02d", $_; } (0..11); count_recs(); sub count_recs { my ($Day, $Month, $Year, $Hour, $Minute, $Second); my ($datetime, $get_post); my $dmytime; my $dateproc; my @dates = (); my @dates_mod = (); open (INFILE, "<$file") || die "Cannot read from $file"; while (<INFILE>) { ($datetime,$get_post) = (split / /) [3,6]; $datetime =~ s/\[//; ($Day,$Month,$Year,$Hour,$Minute,$Second)= $datetime =~m#^(\d\d) +/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#; next if ($get_post =~ /\.js$/ || $get_post =~ /\.gif$/ || $get_pos +t =~ /\.css$/ || ($Minute%15) != 00); push (@dates, $datetime); } #outer while # Calculate the time in epoch seconds in the subroutine convert_epoch. + foreach $dmytime (@dates){ my $calc_seconds= &convert_epoch($dmytime); push (@dates_mod, $calc_seconds); } foreach $dateproc (@dates_mod) { $increment{$dateproc}++; } # Print the headings in the output file open (OUTFILE, ">>access_log.out") || die "Cannot write to $file"; print OUTFILE " Date Range of Time Visual Pages per seco +nd(VPPS)\n"; print OUTFILE " Avg Max Min\n"; print OUTFILE "------------ ----------------- ------------\n"; &calculate_time(); print OUTFILE "\n\n\n"; $interval = 3600; &calculate_time(); close(OUTFILE); close(INFILE); } sub convert_epoch { my $localdate = shift; my ($epoch_Day,$epoch_Month,$epoch_Year,$epoch_Hour,$epoch_Minute,$ +epoch_Second)= $localdate =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\ +d):(\d\d)#; my $epoch_seconds = timelocal($epoch_Second, $epoch_Minute, $epoch_ +Hour, $epoch_Day, $MonthToNumber{$epoch_Month}, $epoch_Year-1900); return $epoch_seconds; } sub calculate_time { my $first = 0; my $start_time; my $end_time; my $time; my ($max, $min, $avg, $count); my $partial=0; my $col_ind; my $processed; my $tm; foreach $processed ( sort keys %increment ) { if ($first == 0 ) { $start_time = $processed; $end_time = $start_time + $interval; $time = $start_time; $min = $increment{$processed}; $max = 0; $avg = 0; $first++; } if ($processed < $end_time) { if ($increment{$processed} < $min) { $min = $increment{$processed}; } if ($increment{$processed} > $max) { $max = $increment{$processed}; } $avg+= $increment{$processed}; $count++; $partial = 1; }else { $avg=$avg/$count; my $tm = localtime($end_time); my $start = localtime($start_time); if ($interval == 900) { $col_ind = 'P'; }else { $col_ind = 'H'; } printf OUTFILE "%01s %02d/%02d/%04d %02d:%02d:00-%02d:%02d:0 +0 %03.2f %03d %03d\n", $col_ind, ($tm->mon)+1,$tm->mday, $tm->year+19 +00,$start->hour, $start->min, $tm->hour,$tm->min, $avg, $max, $min; $min = $increment{$processed}; $max = $increment{$processed}; $avg = $increment{$processed}; $count = 1; $start_time = $processed; $time = $start_time; $end_time = $start_time + $interval; $partial = 0; } } if ( $partial == 1 ) { $avg = $avg/$count; $tm = localtime($end_time); my $start = localtime($start_time); printf OUTFILE "%01s %02d/%02d/%04d %02d:%02d:00-%02d:%02d:00 %0 +3.2f %03d %03d\n", $col_ind, ($tm->mon)+1,$tm->mday, $tm->year+1900,$ +start->hour, $start->min, $tm->hour,$tm->min, $avg, $max, $min; } }

      Pl. suggest any improvements you can think of. I know that I used the MonthToNumber (convert "Apr" to "04" and so on) as a Global. Thanks to you for that input.

      Also, I used the hash %increment as a global. May be I could pass it as an argument to the subroutine calculate_time?

      I needed to process two intervals of time, 15min and 1 hour. That's why I called the subroutine twice, with the interval used globally. Any thoughts?

      Performance wise, I got a 100MB file processed in about 1min 5 secs.

      My next step is to be able to process multiple input files. Currently, I hardcoded an input file. But, I would like to pass multiple log files as input and process them all.
      -Andy

        Forgot to add the output! Here it is.
        Date Range of Time Visual Pages per second(VPPS) Avg Max Min ------------ ----------------- ------------ P 04/25/2003 14:00:00-14:15:00 1.28 006 001 P 04/25/2003 14:15:00-14:30:00 4.50 010 001 P 04/25/2003 14:30:00-14:45:00 5.19 014 001 P 04/25/2003 14:45:00-15:00:00 5.17 013 001 P 04/25/2003 15:00:00-15:15:00 5.60 013 001 P 04/25/2003 15:15:00-15:30:00 5.80 014 001 P 04/25/2003 15:30:00-15:45:00 5.90 014 001 P 04/25/2003 15:45:00-16:00:00 5.60 011 001 P 04/25/2003 16:00:00-16:15:00 5.08 012 001 P 04/25/2003 16:15:00-16:30:00 5.38 011 001 P 04/25/2003 16:30:00-16:45:00 5.48 014 001 P 04/25/2003 16:45:00-17:00:00 5.84 013 001 P 04/25/2003 17:00:00-17:15:00 5.85 015 001 P 04/25/2003 17:15:00-17:30:00 5.65 012 002 P 04/25/2003 17:30:00-17:45:00 5.95 022 001 P 04/25/2003 17:45:00-18:00:00 6.17 016 001 P 04/25/2003 18:00:00-18:15:00 6.05 012 001 P 04/25/2003 18:15:00-18:30:00 5.44 015 001 P 04/25/2003 18:30:00-18:45:00 5.67 015 001 P 04/25/2003 18:45:00-19:00:00 6.12 014 001 P 04/25/2003 19:00:00-19:15:00 6.16 014 001 P 04/25/2003 19:15:00-19:30:00 5.33 016 001 P 04/25/2003 19:30:00-19:45:00 4.83 012 001 H 04/25/2003 14:00:00-15:00:00 4.50 014 001 H 04/25/2003 15:00:00-16:00:00 5.73 014 001 H 04/25/2003 16:00:00-17:00:00 5.43 014 001 H 04/25/2003 17:00:00-18:00:00 5.89 022 001 H 04/25/2003 18:00:00-19:00:00 5.84 015 001 H 04/25/2003 19:00:00-20:00:00 5.46 016 001 ~
      Hi Parv,
      How are you doing? Long time since I emailed you again. I have another question. In the output below, if there is data missing, I should still report the timestamp range and report Avg, Max, Min as 0s (zeroes). Basically what I mean is that I could have my input data something like:
      06/14/2003 06/14/2003 15:59:12 06/14/2003 17:00:02 06/14/2003 17:00:08 ....

      Currently, my output shows that gap in the data.
      Date Range of Time Visual Pages per second(VPPS) Avg Max Min ------------ ----------------- ------------ P 04/25/2003 14:00:00-14:15:00 1.28 006 001 P 04/25/2003 14:15:00-14:30:00 4.50 010 001 P 04/25/2003 14:30:00-14:45:00 5.19 014 001 P 04/25/2003 14:45:00-15:00:00 5.17 013 001 P 04/25/2003 15:00:00-15:15:00 5.60 013 001 P 04/25/2003 15:15:00-15:30:00 5.80 014 001 P 04/25/2003 15:30:00-15:45:00 5.90 014 001 P 04/25/2003 15:45:00-16:00:00 5.60 011 001 P 04/25/2003 17:00:00-17:15:00 5.85 015 001 P 04/25/2003 17:15:00-17:30:00 5.65 012 002 P 04/25/2003 17:30:00-17:45:00 5.95 022 001 P 04/25/2003 17:45:00-18:00:00 6.17 016 001 P 04/25/2003 18:00:00-18:15:00 6.05 012 001 P 04/25/2003 18:15:00-18:30:00 5.44 015 001 P 04/25/2003 18:30:00-18:45:00 5.67 015 001 P 04/25/2003 18:45:00-19:00:00 6.12 014 001 P 04/25/2003 19:00:00-19:15:00 6.16 014 001 P 04/25/2003 19:15:00-19:30:00 5.33 016 001 P 04/25/2003 19:30:00-19:45:00 4.83 012 001 H 04/25/2003 14:00:00-15:00:00 4.50 014 001 H 04/25/2003 15:00:00-16:00:00 5.73 014 001 H 04/25/2003 17:00:00-18:00:00 5.89 022 001 H 04/25/2003 18:00:00-19:00:00 5.84 015 001 H 04/25/2003 19:00:00-20:00:00 5.46 016 001

      But I still want to show the missing timestamps as:
      P 04/25/2003 16:00:00-16:15:00 0.00 000 000 P 04/25/2003 16:15:00-16:30:00 0.00 000 000 P 04/25/2003 16:30:00-16:45:00 0.00 000 000 P 04/25/2003 16:45:00-17:00:00 0.00 000 000 H 04/25/2003 16:00:00-17:00:00 0.00 000 000
      Shall appreciate help on this.
      Thanks

      Andy