Data structure for statistics

Corion has asked for the wisdom of the Perl Monks concerning the following question:

I'm (re)writing a "live" log analyzer (think webserver logs), and I want to keep track of statistical variables like the total number of hits for specific URLs, the average number of hits for reporting periods (per second, per minute, ...) and the number of hits in the last reporting period (second, minute, ...).

What I'm looking for is a good algorithm to avoid having to store and lug around all the log data. My current approach is to keep a queue (a Perl array) of counters for each minimal reporting interval. These counters per reporting interval then get aggregated up to the higher aggregations:

use List::Utils qw(reduce);

my %hits = map {
    $_ => [0],
} (qw(google msn spam));

my $start = time;
my $max_queue_size = 60; # our window is 60 seconds
while (defined my $line = <$log>) {
    # process log
    my $who = get_hit($line);
    if ($who) {
        my $now = time;
        my $bucket = $max_queue_size - $now-$start;

        # Keep our queue at the appropriate length
        my $queue_length = @{ $hits{ $who }};
        if ($queue_length > $max_queue_size) {
            splice @{ $hits{ $who }}, 0, $queue_length - $max_queue_si
+ze;
        };
        $hits{ $who }->[$bucket]++; # count a hit
    };

    # Aggregate the buckets using reduce()
    # respectively update the aggregates by adding the first item
    # and subtracting the item that just fell out of the reporting win
+dow
};
[download]

This approach means that I'll have to keep 86400 array slots per hit candidate ("google", "msn", ...) if I want statistics for a day. I consider that a bit unsatisfactory. I could save quite some memory by using pack and unpack with c on a bytestring, as I expect the number of hits per second to be well below 100. Using a sparse list seems imprudent as I expect a hit at least every two seconds, but maybe I'm thinking of the wrong kind of sparse list.

Calculating the aggregates themselves is easy, as I just have to subtract the last bucket and add the first bucket to the current running average, and choose the two buckets on the reporting interval size.

But maybe there is a less memory and thought intensive approach. The per-second resolution gets uninteresting after 5 minutes have passed, so maybe I could keep the per-second data around for 5 minutes and then use the per-minute data to generate the averages of higher granularity. But I don't have any convenient idea on prior art, which is why I turn to all of you.

Comment on Data structure for statistics Select or Download Code

Replies are listed 'Best First'.
Re: Data structure for statistics by JavaFan (Canon) on Oct 05, 2008 at 13:53 UTC
Sounds like a job for `rrdtool`.	[reply]
Re: Data structure for statistics by Tanktalus (Canon) on Oct 05, 2008 at 20:03 UTC
I'm pretty sure you're not going to go with my solution, but having just produced some live stats, I approached it a bit differently. I put the raw data into a database, and then, in another process, scanned the database for NULL fields, and calculated them. Then I just used some SQL to query counts, grouped by interesting fields as appropriate. Right before the query, I would DELETE FROM LOGS WHERE STAMP < CURRENT TIMESTAMP - 7 DAYS, to keep the logs circular. This meant that I figured out what was important on each line only once, and then pulled all the numbers from the db as many times as needed (~168 times - once an hour for seven days before that line got deleted). Performance-wise, mojotoad's approach (which seemed to parse the entire set of data each time) took about 3 seconds to produce, whereas mine is steadily hovering between 0.3 and 0.7 seconds - most of the time, around 0.4 seconds. Basically, an order of magnitude greater speed (most of the hard work is done outside the generation). Mind you, I may also be using an order of magnitude more memory, too, hard to tell ;-) though I do have more stats than he did.	[reply]
Re^2: Data structure for statistics by Corion (Patriarch) on Oct 05, 2008 at 20:10 UTC
I've considered the approach of using an SQLite database as well. It makes formulating the statistics I want to collect much easier and of course extending is is just a matter of coming up with the right SQL instead of having to muck around with the counters manually. But the thing that kept me away from this so far is that I shy away from storing all the data in memory and that more or less a full table scan will need to be done every second. Of course, I could cheat here and only update the seconds-resolution statistics every second and update (say) the minute-resolution statistics every five seconds... It seems that I'll have to benchmark SQLite and its memory/disk requirements and compare them to the bytestring version.	[reply]
Re: Data structure for statistics by GrandFather (Saint) on Oct 05, 2008 at 19:59 UTC
What do you actually want to report? Do you want to be able to view a graph of hits per second for every second of a day, or just for the last N seconds? Ditto minutes, hours, days, weeks, ...? For edification (and perhaps test) purposes does it make sense to write a test log generator? Can you show us a sample of the output you want to generate? Am I really asking Corion these sorts of questions? :-D Perl reduces RSI - it saves typing	[reply]
Re^2: Data structure for statistics by Corion (Patriarch) on Oct 05, 2008 at 20:05 UTC
Heh - yes, an example will surely help. In the long run, I want to display fancy graphs of things, possibly through SVG, PNG or OpenGL, but for the start, I want simple textual output, say, like the following: `avg/s cur/s avg/min cur/min avg/h cur/h + avg/day cur/day google: 6 10 360 500 21600 43200 + ... ... yahoo: 2 1 120 ... .. .. + ... ... ...` [download] Here, `avg/s` stands for the average number of hits per second, across the whole reporting period, while `cur/s` stands for the current number of hits per second (or rather, the number of hits in the last second). `avg/min` stands for the average number of hits per minute, and `cur/min` for the current number of hits per minute, that is, the number of hits in the last 60 seconds. For `/h`, it's hour and for `/day` it's the day. The applications where periods longer than 7 days are interesting are periods where a 10 second or 1 minute resolution will be sufficient. Currently, I'm still looking for a "better" approach than storing all the hits for all the seconds in my time windows, but so far haven't found one. Update: Clarified averages and current.	[reply] [d/l] [select]
Re^3: Data structure for statistics by GrandFather (Saint) on Oct 05, 2008 at 22:35 UTC
Something like: use strict; use warnings; my %urls = ( google => {min => 0, max => 20}, yahoo => {min => -3, max => 10}, msn => {min => -5, max => 5}, ); my @units = ( {unit => 's', tipAt => 60, updateAt => 20}, {unit => 'min', tipAt => 60, updateAt => 1}, {unit => 'h', tipAt => 24, updateAt => 1}, {unit => 'day', tipAt => 0, updateAt => 0}, ); my %stats; buildStructure (\%stats, \%urls, @units); srand (1); for my $second (1 .. 24 * 60 * 60) { my %hits = genHits (%urls); for my $url (keys %hits) { my $hit = $hits{$url}; my $bump = 1; my $updateHit = 0; my $updateCount = 0; my $urlStats = $stats{$url} \|\|= {}; $urlStats->{total} += $hit; for my $unit (@units) { my $unitStats = $urlStats->{$unit->{unit}}; if ($updateCount) { $unitStats->{extraHits} = $updateHit; $unitStats->{extraCount} = $updateCount; } last unless $bump; $bump = 0; $unitStats->{lhits} \|\|= 0; $hit = $unitStats->{hits} += $hit; ++$unitStats->{count}; if ($unit->{updateAt} and ! ($unitStats->{count} % $unit-> +{updateAt})) { $updateHit = $unitStats->{hits}; $updateCount = $unitStats->{count} / $unit->{tipAt}; } next if ! $unit->{tipAt} or $unitStats->{count} < $unit->{ +tipAt}; # Tipping time $unitStats->{lhits} = $unitStats->{hits}; $unitStats->{hits} = 0; $unitStats->{count} = 0; $unitStats->{extraHits} = 0; $unitStats->{extraCount} = 0; $bump = 1; } } next if int rand (1000); # Gen header printf "%-10s" . ((' %7s %7s') x @units) . "\n", 'URL', map {("avg/$_->{unit}", "cur/$_->{unit}")} @units; # Show stats for my $url (keys %hits) { my $urlStats = $stats{$url}; my $total = $urlStats->{total}; my $scale = 1; printf "%-10s", $url; for my $unit (@units) { my $unitStats = $urlStats->{$unit->{unit}}; my $avg = $total / $second * $scale; my $cur = $unitStats->{hits} + $unitStats->{lhits} + $unit +Stats->{extraHits}; my $curScale = $unitStats->{count} + $unitStats->{extraCou +nt}; $scale *= $unit->{tipAt} if $unit->{tipAt}; $curScale += $unit->{tipAt} if $second > $scale; $cur /= $curScale if $curScale; printf ' %7d %7d', $avg, $cur; } print "\n"; } } exit; sub buildStructure { my ($stats, $urls, @units) = @_; for my $url (keys %urls) { my $urlStats = $stats->{$url} = {}; for my $unit (@units) { my $unitStats = $urlStats->{$unit->{unit}} = {}; $unitStats->{$_} \|\|= 0 for qw(count hits lhits extraCount extraHits); } } } sub genHits { my (%urls) = @_; my %hits; for my $url (keys %urls) { my $hits = rand ($urls{$url}{max} - $urls{$url}{min}) + $urls{$url}{m +in}; $hits = 0 if $hits < 0; $hits{$url} = int $hits; } return %hits; } [download] Prints: `... URL avg/s cur/s avg/min cur/min avg/h cur/h avg/day cur +/day google 9 9 573 571 34407 34409 825779 82 +6904 yahoo 3 3 208 210 12503 12502 300079 29 +9128 msn 1 0 60 59 3617 3617 86809 8 +6696 URL avg/s cur/s avg/min cur/min avg/h cur/h avg/day cur +/day google 9 8 572 569 34365 34384 824771 82 +6904 yahoo 3 3 208 211 12524 12517 300588 29 +9128 msn 1 0 60 58 3601 3603 86427 8 +6696 ...` [download] Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^3: Data structure for statistics by GrandFather (Saint) on Oct 05, 2008 at 20:09 UTC
I presume avg is over some modest window time and that cur is an average over a narrow window time. How do those times relate to units (s, min, h ...)? Perl reduces RSI - it saves typing	[reply]
Re: Data structure for statistics by NetWallah (Canon) on Oct 06, 2008 at 04:43 UTC
I'd like to second JavaFan's rrdtool suggestion, only take it one step further - cacti includes rrdtool, and provides graphing functionality. Once you set it up, it pretty much self regulates - that is the advantage of rrd (round-robin database), and provides graphs of varying resolutions - fine for current data, and more aggregated for older data. Cacti is cross-platform, and free. I have no association with the authors, but have used it, and am a fan. Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax	[reply]
Re^2: Data structure for statistics by wazoox (Prior) on Oct 06, 2008 at 16:43 UTC
Instead of cacti which is written in PHP, I'd go for Munin which is written in Perl.	[reply]
Re: Data structure for statistics by bart (Canon) on Oct 06, 2008 at 10:13 UTC
Just my 2 cents: I could save quite some memory by using pack and unpack with c on a bytestring, as I expect the number of hits per second to be well below 100. Well, if you plan on doing it in plain perl... You can use larger buckets than pack's 'C' (not 'c', no reason for it to be signed), for example short 'v' or long 'V', or their 'n'/'N' counterparts. And don't forget about vec, in case you want to read/update individual slots. If you don't care about older data except when in processed in bulk, then note that pack also supports BER compressed integers ('w'), which makes byte strings of minimal length, storing up to 7 bits of useful data per byte. Unfortunately it makes looking up an individual number by array index a bit hard. And finally, you can abuse UTF8, and store this array of integers as a string of characters, which can easily hold larger values than 255 per character!	[reply]
Re^2: Data structure for statistics by salva (Canon) on Oct 06, 2008 at 11:07 UTC
There are also several CPAN modules that allow to create arrays that internally store the data in memory packed. For instance: Tie::Array::PackedC, Tie::Array::Packed or Array::Packed	[reply]