Andy61 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
I am new to this group. Found very good information!

Here is what I am looking at.

I am creating a perl script to parse the web log file, access_log.

The format of the access_log is:

127.0.0.1 - - 15/Jun/2003:13:54:02 -0100 "GET /xxxx HTTP/1.1" 200 34906

The goal is to

1. Perfom a count of the pages for the given timestamp. It is possible that multiple pages exist with the same timestamp (As the timestamp I mentioned above).

2. Within a range of time interval, say, 15 minutes starting with the timestamp of the first line in the log file, I would like to compute the average of the number of pages, minimum and maximum number of pages in that interval.

3. I would like the output as below. Following is just an example.

Time Average Pages Min Pages Max Pages

--------------------------- ----------------- -----------------

15/Jun/2003:14:09:02 6.5 3 10

15/Jun/2003:14:24:02 5.5 4 7

----------

Here is the perl script I created.

----------------
#!/usr/bin/perl ###use strict; use Getopt::Long; use Time::Local; my $file="access_log_modified"; my $begin_time = ""; my $end_time; my @visual_pages = (); my @final_visual_pages = (); my %increment = (); my ($datetime, $get_post, $Day, $Month, $Year, $Hour, $Minute, $Second +); my $interval = 60; #An interval of 1 minute count_recs(); sub count_recs { open (INFILE, "<$file") || die "Cannot read from $file"; WHILELOOP: while (<INFILE>) { ($datetime,$get_post) = (split / /) [3,6]; $datetime =~ s/\[//; ($Day,$Month,$Year,$Hour,$Minute,$Second)= $datetime +=~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#; next WHILELOOP if ($get_post =~ /\.js$/ || $get_post = +~ /\.gif$/ || $get_post =~ /\.css$/); unless ($begin_time) { $begin_time = $datetime; } push (@dates, $datetime); } #outer while foreach $dateproc (@dates) { $increment{$dateproc}++; } foreach $dateproc (sort keys %increment) { push (@{$processed_visual_pages{$dateproc}}, $increment{$datepro +c}); print "$dateproc @{$processed_visual_pages{$dateproc}}\n"; } close(INFILE); }

----------

Here is the output I get:

---------------

25/Apr/2003:13:54:02 3

25/Apr/2003:13:54:19 2

25/Apr/2003:13:54:22 4

25/Apr/2003:13:54:34 3

25/Apr/2003:13:54:38 5

-----------------

I am able to get the count for each of the timestamps. However, I am having trouble getting the records formatted in the interval range.

I shall appreciate an early help in solving this problem.

Thanks in Advance

Andy

Replies are listed 'Best First'.
Re: Parsing of the web log file, access_log
by tall_man (Parson) on Jun 19, 2003 at 21:34 UTC
    Before spending a lot of time making your own log file parser, you might want to look at what Apache::ParseLog does.
      Hi, I already had a look at it and it looks to me that it's not what I am looking at. Andy
Re: Parsing of the web log file, access_log
by tall_man (Parson) on Jun 19, 2003 at 23:21 UTC
    Ok then. You have hit counts collected by exact time stamps, and you want to average them over 15-minute intervals. Is the problem that you need to subtract dates and times in order to see if you are within an interval? Then maybe you need Date::Calc or Date::Manip (the latter has a lot of overhead).

      I have no problem subtracting dates as I use "localtime" and timelocal" routines provided by the standard Perl Module, "Time::Local".

      Yes, I am having problem with how to output the records within an interval.

      Infact, I wrote a subroutine, which I didn't put in the code that I posted. I know that it's not complete too.This is where I need help!

      Here it is:
      sub calculate_time { ($begin_Day,$begin_Month,$begin_Year,$begin_Hour,$begin_Minute,$beg +in_Second)= $begin_time =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d) +:(\d\d)#; ($end_Day,$end_Month,$end_Year,$end_Hour,$end_Minute,$end_Second)= $da +teproc =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#; &Initialize; my $begin_seconds = timelocal($begin_Second, $begin_Minute, $begin_ +Hour, $begin_Day, $MonthToNumber{$begin_Month}, $begin_Year-1900); my $end_seconds = timelocal($end_Second, $end_Minute, $end_Hour, $e +nd_Day, $MonthToNumber{$end_Month}, $end_Year-1900); my $elapsed = $end_seconds - $begin_seconds; if ( $elapsed < $interval ){ push (my @visual_page_values, {$processed_visual_pages{$dateproc +}}); print "The End seconds are: $dateproc @{$processed_visual_pages{$d +ateproc}}\n"; }else { $begin_time = $dateproc; push (@final_visual_pages, $dateproc); print " Final Visual pages are: @final_visual_pages\n"; } } sub Initialize { my %MonthToNumber=( 'Jan', '01', 'Feb', '02', 'Mar', '03', 'Apr', '04', 'May', '05', 'Jun', '06', 'Jul', '07', 'Aug', '08', 'Sep', '09', 'Oct', '10', 'Nov', '11', 'Dec', '12', ); my %NumberToMonth=( '01', 'Jan', '02', 'Feb', '03', 'Mar', '04', 'Apr', '05', 'May', '06', 'Jun', '07', 'Jul', '08', 'Aug', '09', 'Sep', '10', 'Oct', '11', 'Nov', '12', 'Dec', ); }

        I couldn't take the sub calculate_time as written. Below is a version w/ duplications reduced, does not solve the actual problem of printing data in the intervals (see OP).

        # Jun 20 2003 - create hash w/ "map" instead of explicit creation my %MonthToNumber; @MonthToNumber{qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)} = map { sprintf "%02d" , $_; } (1..12); my %NumberToMonth = map { $MonthToNumber{$_} => $_ } keys %MonthToNumber; sub calculate_time { my $get_sec = sub { my @time = reverse @{ parse_date($_[0]) }; return # second, minute, hour, day month year timelocal(@time[0..(scalar @time -3)] , $time[-2] -1 , $time[- +1]); }; my ($begin_sec , $end_sec) = ( $get_sec->($begin_time) , $get_sec->($dateproc) ); my $elapsed = $end_sec - $begin_sec; #printf "BEGIN: %s(%s) END: %s(%s)\nELAPSED: %s\n" # , $begin_sec , $begin_time # , $end_sec , $dateproc # , $elapsed; if ( $elapsed < $interval ) { push (my @visual_page_values, {$processed_visual_pages{$dateproc}} +); print "The End seconds are: $dateproc @{$processed_visual_pages{$ +dateproc}}\n"; } else { $begin_time = $dateproc; push (@final_visual_pages, $dateproc); print " Final Visual pages are: @final_visual_pages\n"; } } sub parse_date { my $date = shift; return [ ] unless defined $date; my ($day, $month, $year, $hour, $minute, $second) = split '[/:]' , $date; return [ $year , $MonthToNumber{$month} , $day , $hour , $minute , $second ]; }

        Other Notes (Jun 20 2003):

        • If parse_date() is not going to be used elsewhere, contents of the returned array reference should be reversed (to avoid reverse()-ing later for timelocal()).
        • Similar like above, if %MonthToNumber is used for the sole purpose to convert a month name to number for timelocal(), one could just use the hash values 0-11 instead of 1-12. In which case there also would be no need to use sprintf. More importantly, @time can be passed as it is, w/o the need of adjustment to any individual value.
        • %NumberToMonth seems unnecessary if/when it is employed few times, for some definitions of few.

        Whew! It's very hard to get details from you about what you are doing. I'm still not sure if you've shown me the part that you're having trouble with, because I don't see any code for finding the averages.

        However, I noticed a strange line here:

        push (my @visual_page_values, {$processed_visual_pages{$dateproc}});
        That "my" is scoped inside an if block and it won't be visible elsewhere. Also, for some reason you're creating a hash reference that has only one element, not a key/value pair.

        I notice at the start of your program that you commented out "use strict;" That's a very bad idea. I doubt you will be able to untangle the uses of "my" and global variables until you turn strict back on.

Re: Parsing of the web log file, access_log
by dash2 (Hermit) on Jun 20, 2003 at 00:30 UTC
    I really think you should consider using the modules other people have mentioned. You are writing your own code to parse the access log, and it looks pretty hairy. Then you are writing your own code to parse dates, and it looks pretty hairy too! Learning to use well-known modules is a price worth paying.

    Of course, you may have a great reason not to use Apache::AccessLog and Date::Manip, but if so, what is it?

    andramoiennepemousapolutropon


      Thanks for the advice. As I mentioned, the code was unfinished and makes you think it's hairy. I was short of ideas on the interval part of the code and that's where I needed help!

      I didn't see any Apache::AccessLog in the CPAN site. Is it available anywhere else? OR you mean the ParseLog Module?

      From your experience, which one is preferable, Date::Manip or Date::Calc?

      Shall appreciate your valuable advice!
      -Andy

        From your experience, which one is preferable, Date::Manip or Date::Calc?
        Date::Calc is almost certainly preferable as it is a faster and smaller module - the only draw back is that it is mostly implemented in XS which means that you have to be able to compile C to be able to install it, but again that shouldn't be a problem in most places. Even the author of Date::Manip says in the documentation for that module:
        Is Date::Manip the one you should be using? In my opinion, the answer is no about 90% of the time.
        Look at the Date::Manip manpage if you want to read the reasons for that statement in full.

        /J\
        
Re: Parsing of the web log file, access_log
by parv (Parson) on Jun 22, 2003 at 08:06 UTC

    Given the DATA at the end of the program, repeated below...

    127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906

    ...output is...

    Time                  Total pages   Avg pages  Min pages  Max pages
    -------------------------------------------------------------------
    15/Jun/2003:13:05:00            6        1.2          1          2
    15/Jun/2003:13:20:00            2        1.0          1          1
    15/Jun/2003:13:35:00            1        1.0          1          1
    15/Jun/2003:13:50:00            3        1.3          1          2
    15/Jun/2003:14:20:00            1        1.0          1          1
    

    I decided to create hash keys based on the interval size desired. That way, there will be no need to restructure the hash, or to do any other similar processing, by interval size. That saves from lugging around an array ref for each and every time event in the interval in the mean time.

    my ($start , $old , %count); while ( <LOGFILE> ) { my ($time , $file) = (split / /)[3,6] or next; ... $old = (0 == ($time - $start) % $period) ? $time : $old; push @{ $count{$old}->{$time} }, 1; }

    Time::CTime::strftime() and Time::ParseDate::parsedate() come from Time-modules collection. Now the program (Jun 22 2003 1810: podified and somewhat restructured)...

      Hey there is a bug. Two lines seem to be missing from the output. Bug is in...

      $old = (0 == ($time - $start) % $period) ? $time : $old; ... my ($size , ... ) = (scalar @raw);

      ...which should have been...

      $old += ($time - $old >= $period) ? $period : 0; ... my ($size , ...); $size += $_ foreach @raw;

      One more try to get it right...


        Sorry for the delayed response. Thanks a lot for that code. I was also working on this and managed to get it working.

        Before I clean up the code and post it here, the requirements have changed! :-(

        The idea is to start the processing of the log file by the quarter of hour. This means, if the first line has a timestamp something like 25/Apr/2003:13:54:02, we want to throw away all such records and consider processing from a record having a timestamp 25/Apr/2003:14:00:00. Well, the question is, what if we have no such record?What if the timestamp is 14:00:01?

        Good point. Then we want to start processing as if the record's timestamp is 14:00:00. Then consider an interval of 15 min from that timestamp.

        Sorry for the change in requirements!

        Thanks once again!
        Andy
Re: Parsing of the web log file, access_log
by YuckFoo (Abbot) on Jun 20, 2003 at 22:33 UTC
    Andy,

    Here is how I would do it.

    - convert all times to seconds.
    - make all times relative to the base time.
    - determine a major key, the fifteen minute interval it's in relative to the base time.
    - determine a minor key, the one minute interval it's in relative to the major key.
    - save memory by processing each 15 minute interval as it completes, in the while loop.

    Hope this gets you on track.

    YuckFoo

    #!/usr/bin/perl use strict; use DateTime; my $MAJOR_SIZE = 15 * 60; my $MINOR_NUM = 15; my $MINOR_SIZE = $MAJOR_SIZE / $MINOR_NUM; my $BASETIME = 0; my %ABBREVS; @ABBREVS{qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)} = (1.. +12); my ($bucket, $oldmajor); while (my $line = <DATA>) { chomp ($line); my (undef, $day, $mon, $year, $hour, $min, $sec) = split(/\W/, $li +ne); my $time = DateTime->new( year => $year, month => $ABBREVS{$mon}, day => $day, hour => $hour, minute => $min, second => $sec, ); $time = $time->epoch(); $BASETIME = $BASETIME || $time; my $relative = $time - $BASETIME; my $major = int($relative / $MAJOR_SIZE); my $minor = int(($relative - ($major * $MAJOR_SIZE)) / $MINOR_SIZE +); if ($major != $oldmajor) { if (defined($bucket)) { process($bucket); $bucket = undef; } } if (!defined($bucket)) { $bucket = {}; $bucket->{major} = $major; $bucket->{minors} = []; } $bucket->{minors}[$minor]++; $oldmajor = $major; print "$line $time $relative $major $minor\n"; } if (defined($bucket)) { process($bucket); } #----------------------------------------------------------- sub process { my ($bucket) = @_; my $major = ($bucket->{major} * $MAJOR_SIZE) + $BASETIME; print "\nmajor: $major\n"; for my $i (0..$MINOR_NUM-1) { my $minor = ($i * $MINOR_SIZE) + $major; print " minor: $minor $bucket->{minors}[$i]\n"; } print "\n"; } __DATA__ [15/Jun/2003:00:02:27 -0500] [15/Jun/2003:00:03:44 -0500] [15/Jun/2003:00:03:44 -0500] [15/Jun/2003:00:03:44 -0500] [15/Jun/2003:00:07:28 -0500] [15/Jun/2003:00:08:44 -0500] [15/Jun/2003:00:08:45 -0500] [15/Jun/2003:00:08:45 -0500] [15/Jun/2003:00:12:28 -0500] [15/Jun/2003:00:13:45 -0500] [15/Jun/2003:00:13:45 -0500] [15/Jun/2003:00:13:46 -0500] [15/Jun/2003:00:17:29 -0500] [15/Jun/2003:00:18:46 -0500] [15/Jun/2003:00:18:46 -0500] [15/Jun/2003:00:18:47 -0500] [15/Jun/2003:00:22:29 -0500] [15/Jun/2003:00:23:47 -0500] [15/Jun/2003:00:23:47 -0500] [15/Jun/2003:00:23:48 -0500] [15/Jun/2003:00:27:30 -0500] [15/Jun/2003:00:28:48 -0500] [15/Jun/2003:00:28:48 -0500] [15/Jun/2003:00:28:49 -0500] [15/Jun/2003:00:32:30 -0500] [15/Jun/2003:00:33:49 -0500] [15/Jun/2003:00:33:49 -0500] [15/Jun/2003:00:33:49 -0500] [15/Jun/2003:00:37:31 -0500]
      <Thanks for the post. Sure, let me try it out! However, I didn't understand why you were defining 2 times, 15 min and 1 min. Also, may be I didn't understand it well, with this approach, how do I determine the no. of same timestamps? For ex. from your data, I could have 2 occurrences of, 15/Jun/2003:00:03:44? May be some other timestamp has 5 occurrences and so on?

      Regards

      Andy