jplindstrom has asked for the wisdom of the Perl Monks concerning the following question:

I'm plotting database performance values with the excellent (excellent!) Chart::Strip module.

A problem I have is that I have a group of values, e.g. wait events of different types, sharing a chart. But the typical value for e.g. "Take Lock" is normally an order of magnitude lower than e.g. "Output to network". So I put these two values in two different charts so the scales of the charts are meaningful.

But, let's say that sometimes something happens, and the values for "Take Lock" goes wild. It should now be put in the other chart with "Output to network", because these values are more similar.

I'd like to have two charts, and dynamically chose on which chart to plot the values. Large values go to the "High" chart, and small values go to the "Low" chart. So I need to somehow cluster the values into two groups that are fairly cohesive, so that all the values that end up on one chart get a scale that is relevant.

But, being mathematically challenged, I have no idea whatsoever how to do this :) What's this kind of thing called (to Google for it)? What kind of strategies can I use to try things out? Some kind of statistics thingy?

(My thoughts so far, which may be enough to solve the problem, is to just separate the values over/under the mean. But that may just be naive.)

/J

Replies are listed 'Best First'.
Re: [OT] Grouping/clustering of values
by tilly (Archbishop) on Apr 07, 2005 at 13:34 UTC
    If you want to chart on the same graph things with very different orders of magnitude size (and some things varying in size by orders of magnitude) then my first thought would be to choose a log scale for one axis. It doesn't look like this module supports that natively, but nothing stops you from taking logs of your values. (log($x)/log(10) would be traditional.) Now all values fit sensibly on one graph and you don't have to figure out how to split them.

    Common examples of data which we normally scale this way are sound volume and earthquake intensity. Both are normally quoted as a log of intensity.

    If you really want to take your original approach, I would take the largest dataset, and group it with things which are within a factor of 10 of its size. Then put everything else in group 2. But then you have to figure out how to handle the situation where a dataset keeps jumping from one chart to another, and you can't get a good graph of response time variations of more than a couple of orders of magnitude.

    I'd suggest trying the log scale first.

Re: [OT] Grouping/clustering of values
by chb (Deacon) on Apr 07, 2005 at 14:02 UTC
    You could google for k-mean clustering.
      Yup, sounds like you want k-means with k=2 or (possibly) fuzzy c-means clustering with c=2 or c=3.