matija has asked for the wisdom of the Perl Monks concerning the following question:

Given a log file with a series of values (message size, say), I need to calculate the average size over the last (some period) - this stuff is eventualy going to be shown as an mrtg graph.

I need to follow this stuff for various values, and for various periods of time, so I created the following code:

sub average_over_time { # # $ages[x,0] = time of article arrival # $ages[x,1] = age of article; # articles are thrown out of the pipe if they've arrived more than $ma +xage # ago. # my (@ages,$sumages); my $maxage=shift @_; return sub { my $limit=time-$maxage; # Squeeze out events that have timed out while (($#ages>-1) && ($ages[0][0]<$limit)) { $sumages-=$ages[0][1]; shift @ages; } if (defined($_[0])) { push(@ages,[time,$_[0]]); $sumages+=$_[0]; } $sumages=0 if $#ages == -1; return ($#ages>-1)?int($sumages/($#ages+1)):0; } } <code> That gives me the ability to say:<code> my $mail5=average_over_time(300); my $mail60=average_over_time(3600);
And later:
$db{"time-mail5"}=&$mail5($time); $db{"time-mail60"}=&$mail60($time);
I can also get out the average without putting in a value by calling it with $average=&$mail60(undef);.

Unfortunately, if you have a lot of events during the time period over which you're calculating, the array can get... a bit too large.

So I'm wondering, is there a method that I'm missing? Some way to calculate a running average over a given time period without having to keep all the values of that period in memory? And if at all possible without eating all of the CPU needed by the processes I'm monitoring?

Replies are listed 'Best First'.
Re: Average over time
by kvale (Monsignor) on Mar 24, 2004 at 17:09 UTC
    If you want to create a uniform moving average with all elements equal weight, I do not see how you cannot keep all the information to compute the average -- it needs to be there by definition.

    But if you are willing to relax the uniformity requirement, it is easily to do something real time. Consider the recursive estimate for the time between emails

    avg( i ) = (1-k)*(t[i] - t[i-1]) + k*avg( i-1 ) k < 1
    where t[i]-t[i-1] is the time interval between the current and previous email and k is a wieghting coefficient. This expression only needs the most recent interval and average to compute the current average.

    For k < 1, the previous elements will be exponentially weighted k**n for avg(i-n) and k is adjusted to vary the aproximate number of events averaged in.

    From the average interval between emails, you can easily derive the emails per time:

    rate[i] = 1/avg[i];
    Update: corrected first equation.

    -Mark

Re: Average over time
by hawtin (Prior) on Mar 25, 2004 at 01:51 UTC

    There are all sorts of reasons why averaging over a time period is a bad way to achieve what you want. The fact you have no option but to keep all the values is just one issue. There are also problems of aliasing and delayed estimates.

    As kvale said some kind of decaying average would let you work on summary data rather than keeping records.

    Have a look at this and this for a more rigerous explination of the issues.

    I think what you want to do is have a "Single Exponentially Smoothed" estimate of the average. As I recall this means you should estimate the value for each time period as:

    $a_factor = 0.1 # You have to play with different # values of the smoothing factor to # match the data you have $new_avg = $this_period_avg*$a_factor + $prev_avg * (1 - $a_factor);

    An $a_factor value of 0.1 is equivalent to a moving average window of something like 20 time slots. So this lets you do the analysis while only maininging 5% of the data (I am sure that some other more expert brother will be able to correct this obvious simplification).

Re: Average over time
by Abigail-II (Bishop) on Mar 24, 2004 at 16:05 UTC
    So I'm wondering, is there a method that I'm missing? Some way to calculate a running average over a given time period without having to keep all the values of that period in memory?
    Aren't you already keeping a running average? If not, then what's the code following Squeeze out events that have timed out doing?

    Abigail

      What I posted is one method of keeping a running average over the last (timeperiod). But it requires that I keep all the events in memory until they age out.

      I'm looking for a method that would dispense with that requirement. There may not be one for all I know, but I thought it was worth asking the other monks for ideas.

        If you needed to, you could aggregate individual entries into small blocks of time. For example, 5 second intervals instead of 1 second intervals. At the very least, if you've got multiple entries for any given second (since that seems to be the finest resolution you look at), you would want to store only the average for that second.

        The PerlMonk tr/// Advocate