I'm (re)writing a "live" log analyzer (think webserver logs), and I want to keep track of statistical variables like the total number of hits for specific URLs, the average number of hits for reporting periods (per second, per minute, ...) and the number of hits in the last reporting period (second, minute, ...).
What I'm looking for is a good algorithm to avoid having to store and lug around all the log data. My current approach is to keep a queue (a Perl array) of counters for each minimal reporting interval. These counters per reporting interval then get aggregated up to the higher aggregations:
use List::Utils qw(reduce); my %hits = map { $_ => [0], } (qw(google msn spam)); my $start = time; my $max_queue_size = 60; # our window is 60 seconds while (defined my $line = <$log>) { # process log my $who = get_hit($line); if ($who) { my $now = time; my $bucket = $max_queue_size - $now-$start; # Keep our queue at the appropriate length my $queue_length = @{ $hits{ $who }}; if ($queue_length > $max_queue_size) { splice @{ $hits{ $who }}, 0, $queue_length - $max_queue_si +ze; }; $hits{ $who }->[$bucket]++; # count a hit }; # Aggregate the buckets using reduce() # respectively update the aggregates by adding the first item # and subtracting the item that just fell out of the reporting win +dow };
This approach means that I'll have to keep 86400 array slots per hit candidate ("google", "msn", ...) if I want statistics for a day. I consider that a bit unsatisfactory. I could save quite some memory by using pack and unpack with c on a bytestring, as I expect the number of hits per second to be well below 100. Using a sparse list seems imprudent as I expect a hit at least every two seconds, but maybe I'm thinking of the wrong kind of sparse list.
Calculating the aggregates themselves is easy, as I just have to subtract the last bucket and add the first bucket to the current running average, and choose the two buckets on the reporting interval size.
But maybe there is a less memory and thought intensive approach. The per-second resolution gets uninteresting after 5 minutes have passed, so maybe I could keep the per-second data around for 5 minutes and then use the per-minute data to generate the averages of higher granularity. But I don't have any convenient idea on prior art, which is why I turn to all of you.
In reply to Data structure for statistics by Corion
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |