I'm (re)writing a "live" log analyzer (think webserver logs), and I want to keep track of statistical variables like the total number of hits for specific URLs, the average number of hits for reporting periods (per second, per minute, ...) and the number of hits in the last reporting period (second, minute, ...).

What I'm looking for is a good algorithm to avoid having to store and lug around all the log data. My current approach is to keep a queue (a Perl array) of counters for each minimal reporting interval. These counters per reporting interval then get aggregated up to the higher aggregations:

use List::Utils qw(reduce); my %hits = map { $_ => [0], } (qw(google msn spam)); my $start = time; my $max_queue_size = 60; # our window is 60 seconds while (defined my $line = <$log>) { # process log my $who = get_hit($line); if ($who) { my $now = time; my $bucket = $max_queue_size - $now-$start; # Keep our queue at the appropriate length my $queue_length = @{ $hits{ $who }}; if ($queue_length > $max_queue_size) { splice @{ $hits{ $who }}, 0, $queue_length - $max_queue_si +ze; }; $hits{ $who }->[$bucket]++; # count a hit }; # Aggregate the buckets using reduce() # respectively update the aggregates by adding the first item # and subtracting the item that just fell out of the reporting win +dow };

This approach means that I'll have to keep 86400 array slots per hit candidate ("google", "msn", ...) if I want statistics for a day. I consider that a bit unsatisfactory. I could save quite some memory by using pack and unpack with c on a bytestring, as I expect the number of hits per second to be well below 100. Using a sparse list seems imprudent as I expect a hit at least every two seconds, but maybe I'm thinking of the wrong kind of sparse list.

Calculating the aggregates themselves is easy, as I just have to subtract the last bucket and add the first bucket to the current running average, and choose the two buckets on the reporting interval size.

But maybe there is a less memory and thought intensive approach. The per-second resolution gets uninteresting after 5 minutes have passed, so maybe I could keep the per-second data around for 5 minutes and then use the per-minute data to generate the averages of higher granularity. But I don't have any convenient idea on prior art, which is why I turn to all of you.


In reply to Data structure for statistics by Corion

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.