Re: Parsing of the web log file, access_log
by tall_man (Parson) on Jun 19, 2003 at 21:34 UTC
|
Before spending a lot of time making your own log file parser, you might want to look at what Apache::ParseLog does. | [reply] |
|
|
Hi, I already had a look at it and it looks to me that it's not what I am looking at.
Andy
| [reply] |
Re: Parsing of the web log file, access_log
by tall_man (Parson) on Jun 19, 2003 at 23:21 UTC
|
Ok then. You have hit counts collected by exact time stamps, and you want to average them over 15-minute intervals. Is the problem that you need to subtract dates and times in order to see if you are within an interval? Then maybe you need Date::Calc or Date::Manip (the latter has a lot of overhead). | [reply] |
|
|
I have no problem subtracting dates as I use "localtime" and timelocal" routines provided by the standard Perl Module, "Time::Local".
Yes, I am having problem with how to output the records within an interval.
Infact, I wrote a subroutine, which I didn't put in the code that I posted. I know that it's not complete too.This is where I need help!
Here it is:
sub calculate_time {
($begin_Day,$begin_Month,$begin_Year,$begin_Hour,$begin_Minute,$beg
+in_Second)= $begin_time =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d)
+:(\d\d)#;
($end_Day,$end_Month,$end_Year,$end_Hour,$end_Minute,$end_Second)= $da
+teproc =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#;
&Initialize;
my $begin_seconds = timelocal($begin_Second, $begin_Minute, $begin_
+Hour, $begin_Day, $MonthToNumber{$begin_Month}, $begin_Year-1900);
my $end_seconds = timelocal($end_Second, $end_Minute, $end_Hour, $e
+nd_Day, $MonthToNumber{$end_Month}, $end_Year-1900);
my $elapsed = $end_seconds - $begin_seconds;
if ( $elapsed < $interval ){
push (my @visual_page_values, {$processed_visual_pages{$dateproc
+}});
print "The End seconds are: $dateproc @{$processed_visual_pages{$d
+ateproc}}\n";
}else {
$begin_time = $dateproc;
push (@final_visual_pages, $dateproc);
print " Final Visual pages are: @final_visual_pages\n";
}
}
sub Initialize {
my %MonthToNumber=(
'Jan', '01',
'Feb', '02',
'Mar', '03',
'Apr', '04',
'May', '05',
'Jun', '06',
'Jul', '07',
'Aug', '08',
'Sep', '09',
'Oct', '10',
'Nov', '11',
'Dec', '12',
);
my %NumberToMonth=(
'01', 'Jan',
'02', 'Feb',
'03', 'Mar',
'04', 'Apr',
'05', 'May',
'06', 'Jun',
'07', 'Jul',
'08', 'Aug',
'09', 'Sep',
'10', 'Oct',
'11', 'Nov',
'12', 'Dec',
);
}
| [reply] [d/l] |
|
|
# Jun 20 2003 - create hash w/ "map" instead of explicit creation
my %MonthToNumber;
@MonthToNumber{qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)} =
map { sprintf "%02d" , $_; } (1..12);
my %NumberToMonth =
map { $MonthToNumber{$_} => $_ } keys %MonthToNumber;
sub calculate_time {
my $get_sec =
sub {
my @time = reverse @{ parse_date($_[0]) };
return
# second, minute, hour, day month year
timelocal(@time[0..(scalar @time -3)] , $time[-2] -1 , $time[-
+1]);
};
my ($begin_sec , $end_sec) =
( $get_sec->($begin_time) , $get_sec->($dateproc) );
my $elapsed = $end_sec - $begin_sec;
#printf "BEGIN: %s(%s) END: %s(%s)\nELAPSED: %s\n"
# , $begin_sec , $begin_time
# , $end_sec , $dateproc
# , $elapsed;
if ( $elapsed < $interval ) {
push (my @visual_page_values, {$processed_visual_pages{$dateproc}}
+);
print "The End seconds are: $dateproc @{$processed_visual_pages{$
+dateproc}}\n";
} else {
$begin_time = $dateproc;
push (@final_visual_pages, $dateproc);
print " Final Visual pages are: @final_visual_pages\n";
}
}
sub parse_date {
my $date = shift;
return [ ] unless defined $date;
my ($day, $month, $year, $hour, $minute, $second) =
split '[/:]' , $date;
return
[ $year , $MonthToNumber{$month} , $day
, $hour , $minute , $second
];
}
Other Notes (Jun 20 2003):
- If parse_date() is not going to be used
elsewhere, contents of the returned array reference should
be reversed (to avoid reverse()-ing later for
timelocal()).
- Similar like above, if %MonthToNumber is used for the
sole purpose to convert a month name to number for
timelocal(), one could just use the hash values 0-11
instead of 1-12. In which case there also would be no need to use
sprintf. More importantly, @time can be
passed as it is, w/o the need of adjustment to any individual value.
- %NumberToMonth seems unnecessary if/when
it is employed few times, for some definitions of few.
| [reply] [d/l] [select] |
|
|
|
|
push (my @visual_page_values, {$processed_visual_pages{$dateproc}});
That "my" is scoped inside an if block and it won't be visible elsewhere. Also, for some reason you're creating a hash reference that has only one element, not a key/value pair.
I notice at the start of your program that you commented out "use strict;" That's a very bad idea. I doubt you will be able to untangle the uses of "my" and global variables until you turn strict back on. | [reply] [d/l] |
Re: Parsing of the web log file, access_log
by dash2 (Hermit) on Jun 20, 2003 at 00:30 UTC
|
I really think you should consider using the modules other people have mentioned. You are writing your own code to parse the access log, and it looks pretty hairy. Then you are writing your own code to parse dates, and it looks pretty hairy too!
Learning to use well-known modules is a price worth paying.
Of course, you may have a great reason not to use Apache::AccessLog and Date::Manip, but if so, what is it?
andramoiennepemousapolutropon | [reply] |
|
|
Thanks for the advice. As I mentioned, the code was unfinished and makes you think it's hairy. I was short of ideas on the interval part of the code and that's where I needed help!
I didn't see any Apache::AccessLog in the CPAN site. Is it available anywhere else? OR you mean the ParseLog Module?
From your experience, which one is preferable, Date::Manip or Date::Calc?
Shall appreciate your valuable advice!
-Andy
| [reply] |
|
|
From your experience, which one is preferable, Date::Manip or Date::Calc?
Date::Calc is almost certainly preferable as it is a faster and smaller module - the only draw back is that it is mostly implemented in XS which means that you have to be able to compile C to be able to install it, but again that shouldn't be a problem in most places. Even the author of Date::Manip says in the documentation for that module:
Is Date::Manip the one you should be using? In my opinion, the answer is no about 90% of the time.
Look at the Date::Manip manpage if you want to read the reasons for that statement in full.
/J\
| [reply] |
|
|
|
|
|
|
|
Re: Parsing of the web log file, access_log
by parv (Parson) on Jun 22, 2003 at 08:06 UTC
|
Given the DATA at the end of the program, repeated below...
127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:05:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:06:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:08:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:15:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:18:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:25:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:35:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:13:50:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:04:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:10:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
127.0.0.1 - - [15/Jun/2003:14:20:00] -0100 "GET /xxxx HTTP/1.1" 200 34
+906
...output is...
Time Total pages Avg pages Min pages Max pages
-------------------------------------------------------------------
15/Jun/2003:13:05:00 6 1.2 1 2
15/Jun/2003:13:20:00 2 1.0 1 1
15/Jun/2003:13:35:00 1 1.0 1 1
15/Jun/2003:13:50:00 3 1.3 1 2
15/Jun/2003:14:20:00 1 1.0 1 1
I decided to create hash keys based on the interval size desired. That
way, there will be no need to restructure the hash, or to do any other
similar processing, by interval size. That saves from lugging around an
array ref for each and every time event in the interval in the mean time.
my ($start , $old , %count);
while ( <LOGFILE> ) {
my ($time , $file) = (split / /)[3,6] or next;
...
$old = (0 == ($time - $start) % $period) ? $time : $old;
push @{ $count{$old}->{$time} }, 1;
}
Time::CTime::strftime() and Time::ParseDate::parsedate()
come from
Time-modules
collection. Now the program (Jun 22 2003 1810: podified and somewhat
restructured)...
| [reply] [d/l] [select] |
|
|
$old = (0 == ($time - $start) % $period) ? $time : $old;
...
my ($size , ... ) = (scalar @raw);
...which should have been...
$old += ($time - $old >= $period) ? $period : 0;
...
my ($size , ...);
$size += $_ foreach @raw;
One more try to get it right...
| [reply] [d/l] [select] |
|
|
Sorry for the delayed response. Thanks a lot for that code. I was also working on this and managed to get it working.
Before I clean up the code and post it here, the requirements have changed! :-(
The idea is to start the processing of the log file by the quarter of hour. This means, if the first line has a timestamp something like 25/Apr/2003:13:54:02, we want to throw away all such records and consider processing from a record having a timestamp 25/Apr/2003:14:00:00. Well, the question is, what if we have no such record?What if the timestamp is 14:00:01?
Good point. Then we want to start processing as if the record's timestamp is 14:00:00. Then consider an interval of 15 min from that timestamp.
Sorry for the change in requirements!
Thanks once again!
Andy
| [reply] |
|
|
|
|
|
|
|
Re: Parsing of the web log file, access_log
by YuckFoo (Abbot) on Jun 20, 2003 at 22:33 UTC
|
Andy,
Here is how I would do it.
- convert all times to seconds.
- make all times relative to the base time.
- determine a major key, the fifteen minute interval it's in relative to the base time.
- determine a minor key, the one minute interval it's in relative to the major key.
- save memory by processing each 15 minute interval as it completes, in the while loop.
Hope this gets you on track.
YuckFoo
#!/usr/bin/perl
use strict;
use DateTime;
my $MAJOR_SIZE = 15 * 60;
my $MINOR_NUM = 15;
my $MINOR_SIZE = $MAJOR_SIZE / $MINOR_NUM;
my $BASETIME = 0;
my %ABBREVS;
@ABBREVS{qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)} = (1..
+12);
my ($bucket, $oldmajor);
while (my $line = <DATA>) {
chomp ($line);
my (undef, $day, $mon, $year, $hour, $min, $sec) = split(/\W/, $li
+ne);
my $time = DateTime->new(
year => $year,
month => $ABBREVS{$mon},
day => $day,
hour => $hour,
minute => $min,
second => $sec,
);
$time = $time->epoch();
$BASETIME = $BASETIME || $time;
my $relative = $time - $BASETIME;
my $major = int($relative / $MAJOR_SIZE);
my $minor = int(($relative - ($major * $MAJOR_SIZE)) / $MINOR_SIZE
+);
if ($major != $oldmajor) {
if (defined($bucket)) {
process($bucket);
$bucket = undef;
}
}
if (!defined($bucket)) {
$bucket = {};
$bucket->{major} = $major;
$bucket->{minors} = [];
}
$bucket->{minors}[$minor]++;
$oldmajor = $major;
print "$line $time $relative $major $minor\n";
}
if (defined($bucket)) { process($bucket); }
#-----------------------------------------------------------
sub process {
my ($bucket) = @_;
my $major = ($bucket->{major} * $MAJOR_SIZE) + $BASETIME;
print "\nmajor: $major\n";
for my $i (0..$MINOR_NUM-1) {
my $minor = ($i * $MINOR_SIZE) + $major;
print " minor: $minor $bucket->{minors}[$i]\n";
}
print "\n";
}
__DATA__
[15/Jun/2003:00:02:27 -0500]
[15/Jun/2003:00:03:44 -0500]
[15/Jun/2003:00:03:44 -0500]
[15/Jun/2003:00:03:44 -0500]
[15/Jun/2003:00:07:28 -0500]
[15/Jun/2003:00:08:44 -0500]
[15/Jun/2003:00:08:45 -0500]
[15/Jun/2003:00:08:45 -0500]
[15/Jun/2003:00:12:28 -0500]
[15/Jun/2003:00:13:45 -0500]
[15/Jun/2003:00:13:45 -0500]
[15/Jun/2003:00:13:46 -0500]
[15/Jun/2003:00:17:29 -0500]
[15/Jun/2003:00:18:46 -0500]
[15/Jun/2003:00:18:46 -0500]
[15/Jun/2003:00:18:47 -0500]
[15/Jun/2003:00:22:29 -0500]
[15/Jun/2003:00:23:47 -0500]
[15/Jun/2003:00:23:47 -0500]
[15/Jun/2003:00:23:48 -0500]
[15/Jun/2003:00:27:30 -0500]
[15/Jun/2003:00:28:48 -0500]
[15/Jun/2003:00:28:48 -0500]
[15/Jun/2003:00:28:49 -0500]
[15/Jun/2003:00:32:30 -0500]
[15/Jun/2003:00:33:49 -0500]
[15/Jun/2003:00:33:49 -0500]
[15/Jun/2003:00:33:49 -0500]
[15/Jun/2003:00:37:31 -0500]
| [reply] [d/l] |
|
|
<Thanks for the post. Sure, let me try it out! However, I didn't understand why you were defining 2 times, 15 min and 1 min. Also, may be I didn't understand it well, with this approach, how do I determine the no. of same timestamps? For ex. from your data, I could have 2 occurrences of, 15/Jun/2003:00:03:44? May be some other timestamp has 5 occurrences and so on?
Regards
Andy
| [reply] |