Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
This is more an algorithm question than it is a perl question though I will implement it in perl. Also, I can't describe what I really need to do nor provide any data due impediments beyond my control. I will use an analogy that mostly works but my apologies in advance for when it doesn't.

Scenario

Yesterday, there were a large number of customer complaints concerning really slow response times from our corporate website. This is the 5th time this month and we can no longer get away with "The problem was too transient to identify and has resolved itself". We have a pretty good understanding of typical requests and the time to serve them (min, max, ave, standard deviation, etc). We also know the primary causes of the occassional outliers - for instance, a non-typical request that just takes longer to service. Looking at the logs around the time of the customer complaints, we notice that there are significant more outliers than there should be statistically. What's more odd is that they don't seem to fit into any of the known causes for outliers. Looking at the other complaints over the month, we notice the same clustering of outliers. We have to get to the bottom of this quickly. There is already a team set up to study the requests to determine if they have something in common. Our job is to identify all of these clusters knowing that the problem goes back further than just this month and provide the data to the other team.

Plan

We need to efficiently process a massive amount of logs. Since it is another teams responsibility to find a cause, we only need to worry about two columns - timestamp and response time. We can ignore response times that are within our accepted performance model. We can even ignore the ocassional outlier. What we need to do is identify the start and end time of these clusters and send everything in the logs between those two time stamps over to the other team. The problem is, we don't really know how to define a cluster though we can definately recognize one once we see it. They seem to be of different durations and magnitudes.

Cluster?

We know if we take take a unit of time that is too large, we might miss a cluster. For instance, if we use 10 minute windows a short duration cluster in the middle may average out over the 10 minutes. We know the same is true for too small a unit of time. For instance, we may expect any given second to be at the high end of our bell curve and ignore it but having a solid minute of those is definately a cluster. It seems like we need to scan the same area using different time units which will take too long given the amount of data that needs to be processed.

What to do, what to do?

Disclaimer: While this is a real problem, I am not really as pressed for time as the mock scenario would imply. The data also isn't really that massive and a multi-pass approach is what I will do if I don't come up with better options. Finally, I am actually working on both teams. I presented it the way I did to try and solicit ideas I haven't already thought of.

Cheers - L~R

Replies are listed 'Best First'.
Re: Identifying Outlier Clusters
by BrowserUk (Patriarch) on Feb 16, 2010 at 23:09 UTC

    It quite easy, and relatively efficient to calculate multiple-period moving averages concurrently, requiring just a single pass over the data:

    #! perl -slw use strict; use List::Util qw[ sum ]; our $LOW //= 10; our $HIGH //= 100; our $LIMIT //= 0.85; my %seqs = map{ $_ => [] } $LOW .. $HIGH; for my $i ( 1 .. 1e6 ) { my $value = rand 1; for my $period ( $LOW .. $HIGH ) { if( @{ $seqs{ $period } } < $period ) { push @{ $seqs{ $period } }, $value; } else { shift @{ $seqs{ $period } }; push @{ $seqs{ $period } }, $value; my $ave = sum( @{ $seqs{ $period } } ) / @{ $seqs{ $period + } }; if( $ave > $LIMIT ) { printf "Possible $period period cluster (ave:%.3f)". "at %d through %d\n", $ave, $i - $period, $i; } } } } __END__ C:\test>junk81.pl Possible 10 period cluster (ave:0.869)at 16470 through 16480 Possible 10 period cluster (ave:0.855)at 16471 through 16481 Possible 11 period cluster (ave:0.858)at 16470 through 16481 Possible 10 period cluster (ave:0.857)at 101204 through 101214 Possible 10 period cluster (ave:0.852)at 101205 through 101215 Possible 10 period cluster (ave:0.856)at 146170 through 146180 Possible 10 period cluster (ave:0.866)at 211311 through 211321 Possible 10 period cluster (ave:0.864)at 323774 through 323784 Possible 10 period cluster (ave:0.868)at 442882 through 442892 Possible 12 period cluster (ave:0.850)at 452199 through 452211 Possible 12 period cluster (ave:0.851)at 452200 through 452212 Possible 13 period cluster (ave:0.856)at 452199 through 452212 Possible 10 period cluster (ave:0.852)at 557989 through 557999 Possible 10 period cluster (ave:0.863)at 700401 through 700411 Possible 10 period cluster (ave:0.858)at 759150 through 759160 Possible 10 period cluster (ave:0.862)at 759151 through 759161 Possible 11 period cluster (ave:0.864)at 759150 through 759161 Possible 10 period cluster (ave:0.853)at 759152 through 759162 Possible 11 period cluster (ave:0.866)at 759151 through 759162 Possible 12 period cluster (ave:0.868)at 759150 through 759162 Possible 13 period cluster (ave:0.853)at 759149 through 759162 Possible 13 period cluster (ave:0.850)at 759151 through 759164 Possible 14 period cluster (ave:0.852)at 759150 through 759164 Possible 10 period cluster (ave:0.866)at 786191 through 786201 Possible 10 period cluster (ave:0.851)at 805682 through 805692 Possible 10 period cluster (ave:0.864)at 965272 through 965282 Possible 10 period cluster (ave:0.853)at 992279 through 992289 Possible 10 period cluster (ave:0.886)at 992281 through 992291

    That took just about 4 minutes to calculate 90 different moving averages over a million data points. There's no IO invloved, but as it only requires a single pass, that's an unavoidable constant anyway.

    You can obviously do less and non-consecutive periods.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Identifying Outlier Clusters
by GrandFather (Saint) on Feb 16, 2010 at 22:29 UTC

    Would calculating a moving standard deviation over the data and looking for a significant change in the SD do the trick?


    True laziness is hard work
Re: Identifying Outlier Clusters
by scorpio17 (Canon) on Feb 17, 2010 at 14:26 UTC

    I saved this link a long time ago:

    A New Visualization for Web Server Logs

    I'm not sure if it will help in your case, but it might give you a way to see otherwise non-obvious relationships within the log data.