Parsing of the web log file, access

Andy61 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing of the web log file, access_log by tall_man (Parson) on Jun 19, 2003 at 21:34 UTC
Before spending a lot of time making your own log file parser, you might want to look at what Apache::ParseLog does.	[reply]
Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 19, 2003 at 21:44 UTC
Hi, I already had a look at it and it looks to me that it's not what I am looking at. Andy	[reply]
Re: Parsing of the web log file, access_log by tall_man (Parson) on Jun 19, 2003 at 23:21 UTC
Ok then. You have hit counts collected by exact time stamps, and you want to average them over 15-minute intervals. Is the problem that you need to subtract dates and times in order to see if you are within an interval? Then maybe you need Date::Calc or Date::Manip (the latter has a lot of overhead).	[reply]
Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 19, 2003 at 23:42 UTC
I have no problem subtracting dates as I use "localtime" and timelocal" routines provided by the standard Perl Module, "Time::Local". Yes, I am having problem with how to output the records within an interval. Infact, I wrote a subroutine, which I didn't put in the code that I posted. I know that it's not complete too.This is where I need help! Here it is: sub calculate_time { ($begin_Day,$begin_Month,$begin_Year,$begin_Hour,$begin_Minute,$beg +in_Second)= $begin_time =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d) +:(\d\d)#; ($end_Day,$end_Month,$end_Year,$end_Hour,$end_Minute,$end_Second)= $da +teproc =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#; &Initialize; my $begin_seconds = timelocal($begin_Second, $begin_Minute, $begin_ +Hour, $begin_Day, $MonthToNumber{$begin_Month}, $begin_Year-1900); my $end_seconds = timelocal($end_Second, $end_Minute, $end_Hour, $e +nd_Day, $MonthToNumber{$end_Month}, $end_Year-1900); my $elapsed = $end_seconds - $begin_seconds; if ( $elapsed < $interval ){ push (my @visual_page_values, {$processed_visual_pages{$dateproc +}}); print "The End seconds are: $dateproc @{$processed_visual_pages{$d +ateproc}}\n"; }else { $begin_time = $dateproc; push (@final_visual_pages, $dateproc); print " Final Visual pages are: @final_visual_pages\n"; } } sub Initialize { my %MonthToNumber=( 'Jan', '01', 'Feb', '02', 'Mar', '03', 'Apr', '04', 'May', '05', 'Jun', '06', 'Jul', '07', 'Aug', '08', 'Sep', '09', 'Oct', '10', 'Nov', '11', 'Dec', '12', ); my %NumberToMonth=( '01', 'Jan', '02', 'Feb', '03', 'Mar', '04', 'Apr', '05', 'May', '06', 'Jun', '07', 'Jul', '08', 'Aug', '09', 'Sep', '10', 'Oct', '11', 'Nov', '12', 'Dec', ); } [download]	[reply] [d/l]
Re: Re: Re: Parsing of the web log file, access_log by parv (Parson) on Jun 20, 2003 at 03:44 UTC
I couldn't take the `sub calculate_time` as written. Below is a version w/ duplications reduced, does not solve the actual problem of printing data in the intervals (see OP). # Jun 20 2003 - create hash w/ "map" instead of explicit creation my %MonthToNumber; @MonthToNumber{qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)} = map { sprintf "%02d" , $_; } (1..12); my %NumberToMonth = map { $MonthToNumber{$_} => $_ } keys %MonthToNumber; sub calculate_time { my $get_sec = sub { my @time = reverse @{ parse_date($_[0]) }; return # second, minute, hour, day month year timelocal(@time[0..(scalar @time -3)] , $time[-2] -1 , $time[- +1]); }; my ($begin_sec , $end_sec) = ( $get_sec->($begin_time) , $get_sec->($dateproc) ); my $elapsed = $end_sec - $begin_sec; #printf "BEGIN: %s(%s) END: %s(%s)\nELAPSED: %s\n" # , $begin_sec , $begin_time # , $end_sec , $dateproc # , $elapsed; if ( $elapsed < $interval ) { push (my @visual_page_values, {$processed_visual_pages{$dateproc}} +); print "The End seconds are: $dateproc @{$processed_visual_pages{$ +dateproc}}\n"; } else { $begin_time = $dateproc; push (@final_visual_pages, $dateproc); print " Final Visual pages are: @final_visual_pages\n"; } } sub parse_date { my $date = shift; return [ ] unless defined $date; my ($day, $month, $year, $hour, $minute, $second) = split '[/:]' , $date; return [ $year , $MonthToNumber{$month} , $day , $hour , $minute , $second ]; } [download] Other Notes (Jun 20 2003): If `parse_date()` is not going to be used elsewhere, contents of the returned array reference should be reversed (to avoid `reverse()`-ing later for `timelocal()`). Similar like above, if `%MonthToNumber` is used for the sole purpose to convert a month name to number for `timelocal()`, one could just use the hash values 0-11 instead of 1-12. In which case there also would be no need to use `sprintf`. More importantly, `@time` can be passed as it is, w/o the need of adjustment to any individual value. `%NumberToMonth` seems unnecessary if/when it is employed few times, for some definitions of few.	[reply] [d/l] [select]
Re: Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 20, 2003 at 20:42 UTC
Re^3: Parsing of the web log file, access_log by tall_man (Parson) on Jun 20, 2003 at 00:27 UTC
Whew! It's very hard to get details from you about what you are doing. I'm still not sure if you've shown me the part that you're having trouble with, because I don't see any code for finding the averages. However, I noticed a strange line here: `push (my @visual_page_values, {$processed_visual_pages{$dateproc}});` [download] That "my" is scoped inside an if block and it won't be visible elsewhere. Also, for some reason you're creating a hash reference that has only one element, not a key/value pair. I notice at the start of your program that you commented out "use strict;" That's a very bad idea. I doubt you will be able to untangle the uses of "my" and global variables until you turn strict back on.	[reply] [d/l]
Re: Parsing of the web log file, access_log by dash2 (Hermit) on Jun 20, 2003 at 00:30 UTC
I really think you should consider using the modules other people have mentioned. You are writing your own code to parse the access log, and it looks pretty hairy. Then you are writing your own code to parse dates, and it looks pretty hairy too! Learning to use well-known modules is a price worth paying. Of course, you may have a great reason not to use Apache::AccessLog and Date::Manip, but if so, what is it? andramoiennepemousapolutropon	[reply]
Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 20, 2003 at 02:25 UTC
Thanks for the advice. As I mentioned, the code was unfinished and makes you think it's hairy. I was short of ideas on the interval part of the code and that's where I needed help! I didn't see any Apache::AccessLog in the CPAN site. Is it available anywhere else? OR you mean the ParseLog Module? From your experience, which one is preferable, Date::Manip or Date::Calc? Shall appreciate your valuable advice! -Andy	[reply]
Re: Re: Re: Parsing of the web log file, access_log by gellyfish (Monsignor) on Jun 20, 2003 at 10:02 UTC
From your experience, which one is preferable, Date::Manip or Date::Calc? Date::Calc is almost certainly preferable as it is a faster and smaller module - the only draw back is that it is mostly implemented in XS which means that you have to be able to compile C to be able to install it, but again that shouldn't be a problem in most places. Even the author of Date::Manip says in the documentation for that module: Is Date::Manip the one you should be using? In my opinion, the answer is no about 90% of the time. Look at the Date::Manip manpage if you want to read the reasons for that statement in full. /J\	[reply]
Problem with loading and compiling the Perl module, Date::Calc by Andy61 (Initiate) on Jun 20, 2003 at 18:43 UTC
Re: Problem with loading and compiling the Perl module, Date::Calc by fglock (Vicar) on Jun 20, 2003 at 19:11 UTC
Some notes below your chosen depth have not been shown here
Re: Problem with loading and compiling the Perl module, Date::Calc by fglock (Vicar) on Jun 20, 2003 at 21:21 UTC
Re: Parsing of the web log file, access_log by parv (Parson) on Jun 22, 2003 at 08:06 UTC
Given the `DATA` at the end of the program, repeated below... 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34 +906 [download] ...output is... Time Total pages Avg pages Min pages Max pages ------------------------------------------------------------------- 15/Jun/2003:13:05:00 6 1.2 1 2 15/Jun/2003:13:20:00 2 1.0 1 1 15/Jun/2003:13:35:00 1 1.0 1 1 15/Jun/2003:13:50:00 3 1.3 1 2 15/Jun/2003:14:20:00 1 1.0 1 1 I decided to create hash keys based on the interval size desired. That way, there will be no need to restructure the hash, or to do any other similar processing, by interval size. That saves from lugging around an array ref for each and every time event in the interval in the mean time. `my ($start , $old , %count); while ( <LOGFILE> ) { my ($time , $file) = (split / /)[3,6] or next; ... $old = (0 == ($time - $start) % $period) ? $time : $old; push @{ $count{$old}->{$time} }, 1; }` [download] `Time::CTime::strftime()` and `Time::ParseDate::parsedate()` come from Time-modules collection. Now the program (Jun 22 2003 1810: podified and somewhat restructured)... Read more... (9 kB)	[reply] [d/l] [select]
Re: Re: Parsing of the web log file, access_log by parv (Parson) on Jun 22, 2003 at 22:30 UTC
Hey there is a bug. Two lines seem to be missing from the output. Bug is in... `$old = (0 == ($time - $start) % $period) ? $time : $old; ... my ($size , ... ) = (scalar @raw);` [download] ...which should have been... `$old += ($time - $old >= $period) ? $period : 0; ... my ($size , ...); $size += $_ foreach @raw;` [download] One more try to get it right... Read more... (9 kB)	[reply] [d/l] [select]
Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 24, 2003 at 18:48 UTC
Sorry for the delayed response. Thanks a lot for that code. I was also working on this and managed to get it working. Before I clean up the code and post it here, the requirements have changed! :-( The idea is to start the processing of the log file by the quarter of hour. This means, if the first line has a timestamp something like 25/Apr/2003:13:54:02, we want to throw away all such records and consider processing from a record having a timestamp 25/Apr/2003:14:00:00. Well, the question is, what if we have no such record?What if the timestamp is 14:00:01? Good point. Then we want to start processing as if the record's timestamp is 14:00:00. Then consider an interval of 15 min from that timestamp. Sorry for the change in requirements! Thanks once again! Andy	[reply]
Re: Re: Re: Re: Parsing of the web log file, access_log by parv (Parson) on Jun 24, 2003 at 20:32 UTC
Re: Re: Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 27, 2003 at 18:18 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re: Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jul 11, 2003 at 11:16 UTC
Re: Parsing of the web log file, access_log by YuckFoo (Abbot) on Jun 20, 2003 at 22:33 UTC
Andy, Here is how I would do it. - convert all times to seconds. - make all times relative to the base time. - determine a major key, the fifteen minute interval it's in relative to the base time. - determine a minor key, the one minute interval it's in relative to the major key. - save memory by processing each 15 minute interval as it completes, in the while loop. Hope this gets you on track. YuckFoo #!/usr/bin/perl use strict; use DateTime; my $MAJOR_SIZE = 15 * 60; my $MINOR_NUM = 15; my $MINOR_SIZE = $MAJOR_SIZE / $MINOR_NUM; my $BASETIME = 0; my %ABBREVS; @ABBREVS{qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)} = (1.. +12); my ($bucket, $oldmajor); while (my $line = <DATA>) { chomp ($line); my (undef, $day, $mon, $year, $hour, $min, $sec) = split(/\W/, $li +ne); my $time = DateTime->new( year => $year, month => $ABBREVS{$mon}, day => $day, hour => $hour, minute => $min, second => $sec, ); $time = $time->epoch(); $BASETIME = $BASETIME \|\| $time; my $relative = $time - $BASETIME; my $major = int($relative / $MAJOR_SIZE); my $minor = int(($relative - ($major * $MAJOR_SIZE)) / $MINOR_SIZE +); if ($major != $oldmajor) { if (defined($bucket)) { process($bucket); $bucket = undef; } } if (!defined($bucket)) { $bucket = {}; $bucket->{major} = $major; $bucket->{minors} = []; } $bucket->{minors}[$minor]++; $oldmajor = $major; print "$line $time $relative $major $minor\n"; } if (defined($bucket)) { process($bucket); } #----------------------------------------------------------- sub process { my ($bucket) = @_; my $major = ($bucket->{major} * $MAJOR_SIZE) + $BASETIME; print "\nmajor: $major\n"; for my $i (0..$MINOR_NUM-1) { my $minor = ($i * $MINOR_SIZE) + $major; print " minor: $minor $bucket->{minors}[$i]\n"; } print "\n"; } __DATA__ [15/Jun/2003:00:02:27 -0500] [15/Jun/2003:00:03:44 -0500] [15/Jun/2003:00:03:44 -0500] [15/Jun/2003:00:03:44 -0500] [15/Jun/2003:00:07:28 -0500] [15/Jun/2003:00:08:44 -0500] [15/Jun/2003:00:08:45 -0500] [15/Jun/2003:00:08:45 -0500] [15/Jun/2003:00:12:28 -0500] [15/Jun/2003:00:13:45 -0500] [15/Jun/2003:00:13:45 -0500] [15/Jun/2003:00:13:46 -0500] [15/Jun/2003:00:17:29 -0500] [15/Jun/2003:00:18:46 -0500] [15/Jun/2003:00:18:46 -0500] [15/Jun/2003:00:18:47 -0500] [15/Jun/2003:00:22:29 -0500] [15/Jun/2003:00:23:47 -0500] [15/Jun/2003:00:23:47 -0500] [15/Jun/2003:00:23:48 -0500] [15/Jun/2003:00:27:30 -0500] [15/Jun/2003:00:28:48 -0500] [15/Jun/2003:00:28:48 -0500] [15/Jun/2003:00:28:49 -0500] [15/Jun/2003:00:32:30 -0500] [15/Jun/2003:00:33:49 -0500] [15/Jun/2003:00:33:49 -0500] [15/Jun/2003:00:33:49 -0500] [15/Jun/2003:00:37:31 -0500] [download]	[reply] [d/l]
Re: Re: Parsing of the web log file, access_log by Andy61 (Initiate) on Jun 20, 2003 at 23:08 UTC
<Thanks for the post. Sure, let me try it out! However, I didn't understand why you were defining 2 times, 15 min and 1 min. Also, may be I didn't understand it well, with this approach, how do I determine the no. of same timestamps? For ex. from your data, I could have 2 occurrences of, 15/Jun/2003:00:03:44? May be some other timestamp has 5 occurrences and so on? Regards Andy	[reply]