Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
Yesterday, there were a large number of customer complaints concerning really slow response times from our corporate website. This is the 5th time this month and we can no longer get away with "The problem was too transient to identify and has resolved itself". We have a pretty good understanding of typical requests and the time to serve them (min, max, ave, standard deviation, etc). We also know the primary causes of the occassional outliers - for instance, a non-typical request that just takes longer to service. Looking at the logs around the time of the customer complaints, we notice that there are significant more outliers than there should be statistically. What's more odd is that they don't seem to fit into any of the known causes for outliers. Looking at the other complaints over the month, we notice the same clustering of outliers. We have to get to the bottom of this quickly. There is already a team set up to study the requests to determine if they have something in common. Our job is to identify all of these clusters knowing that the problem goes back further than just this month and provide the data to the other team.
We need to efficiently process a massive amount of logs. Since it is another teams responsibility to find a cause, we only need to worry about two columns - timestamp and response time. We can ignore response times that are within our accepted performance model. We can even ignore the ocassional outlier. What we need to do is identify the start and end time of these clusters and send everything in the logs between those two time stamps over to the other team. The problem is, we don't really know how to define a cluster though we can definately recognize one once we see it. They seem to be of different durations and magnitudes.
We know if we take take a unit of time that is too large, we might miss a cluster. For instance, if we use 10 minute windows a short duration cluster in the middle may average out over the 10 minutes. We know the same is true for too small a unit of time. For instance, we may expect any given second to be at the high end of our bell curve and ignore it but having a solid minute of those is definately a cluster. It seems like we need to scan the same area using different time units which will take too long given the amount of data that needs to be processed.
What to do, what to do?Disclaimer: While this is a real problem, I am not really as pressed for time as the mock scenario would imply. The data also isn't really that massive and a multi-pass approach is what I will do if I don't come up with better options. Finally, I am actually working on both teams. I presented it the way I did to try and solicit ideas I haven't already thought of.
Cheers - L~R
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Identifying Outlier Clusters
by BrowserUk (Patriarch) on Feb 16, 2010 at 23:09 UTC | |
Re: Identifying Outlier Clusters
by GrandFather (Saint) on Feb 16, 2010 at 22:29 UTC | |
Re: Identifying Outlier Clusters
by scorpio17 (Canon) on Feb 17, 2010 at 14:26 UTC |