Grouping numbers

tamaguchi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Grouping numbers by bobf (Monsignor) on Jul 10, 2006 at 01:30 UTC
It sounds like you may want to use K-means clustering (or a variant thereof). Algorithm::Cluster has a number of clustering routines, including K-means. I haven't used it, but it might be a good place to start. HTH	[reply]
Re^2: Grouping numbers by vkleinbb (Initiate) on Jul 12, 2006 at 06:27 UTC
Hmm, bobf, that's a very nice reference, but what he describes fits the QT-clust algorithm described just below much better than the K-means method. It's distinctly the threshold criterion he's using, and there's no fixed number of clusters.	[reply]
Re: Grouping numbers by rodion (Chaplain) on Jul 10, 2006 at 03:42 UTC
As bobf points out, your problem is very much like a statistical clustering task, where the object is to figure out how data is or shoud be grouped. If finding statistical clusters is your underlying task, and you want to find something close to an optimal clustering, then the Algorithm::Cluster is a good bet. It will, however, involve installing the underlying C software, and investing some time learning about clustering. On the other hand, you may just want to do what you said, and are satisfied with any grouping that satisfies the grouping criterion, even if it's a poor grouping. If any clustering is good enough, then you can just go through the sorted list computing an average (or a running average). Each time you look at a new point, check if the new average with that point included puts the first point in the group outside the 'X' distance from the average, or puts the new point you're going to add outside the distance. If the new point expands the group too much, then all points up to that one form a group, and the new point is the first of the next group. `my @refsOfGroups = (); my @group; my $avg; my $point; while ( @pointValues) { $point = shift @pointValues; $avg = average(@group,$point); if ( ($avg - $group[0] > $x_threshold) \|\|($point - $avg > $x_threshold) ) { push @refsOfGroups, [@group]; @group = () } push @group $point; # include the last group } # while push @refsOfGroups, [@group];` [download] (The code is untested, but it's pretty simple.) Note that if you have a lot of points and a lot of big groups, the computation of average() can switche to a running average, where you don't add up all the points each time, you just add in the new value. (This also helps prevent overflow when the sums in the average get too big). Also note that you don't have to take the absolute value of the difference from $avg in the if(), because you know $avg at least as big as the first, and no larger then the one you just included in the average. I hope this is what you need, or that it's good enough, because this is simple to do, and statistical clustering isn't, even if someone else has written the routines. Good luck with your clustering.	[reply] [d/l]
Re: Grouping numbers by GrandFather (Saint) on Jul 10, 2006 at 03:01 UTC
The following code seems to do what you are after, although a module solution may be more appropriate. use warnings; use strict; use constant kMaxDelta => 0.15; my @rawData = (<DATA>); chomp @rawData; @rawData = sort {$a <=> $b} @rawData; my @groups; my @currGroup = (shift @rawData); for (@rawData, undef) { if (defined $_ and abs ($_ - sum (@currGroup, $_) / (@currGroup + +1)) <= kMaxDelta) { # Ok to add to current group push @currGroup, $_; } else { # time for a new group push @groups, [@currGroup]; @currGroup = ($_); } } my $count = 1; for my $group (@groups) { if (@$group == 1) { print "Outlier: $group->[0]\n\n"; } else { my $avg = sum (@$group) / @$group; my @outliers; my @ok; abs ($avg - $_) > kMaxDelta ? push @outliers, $_ : push @ok, $ +_ for @$group; printf "group$count: Average: %.2f\n", $avg; printf "$_ diff=%.2f\n", abs ($avg - $_) for (@ok); print "\n"; ++$count; printf "Outlier: $_ since dif=%.2f>X\n\n", abs ($avg - $_) for + @outliers; } } sub sum { my $sum = shift; $sum += $_ for @_; return $sum; } __DATA__ 100.20 100.23 100.35 122.43 122.55 122.67 122.75 145.88 145.97 146.01 146.10 [download] Prints: `group1: Average: 100.26 100.20 diff=0.06 100.23 diff=0.03 100.35 diff=0.09 group2: Average: 122.60 122.55 diff=0.05 122.67 diff=0.07 122.75 diff=0.15 Outlier: 122.43 since dif=0.17>X group3: Average: 145.99 145.88 diff=0.11 145.97 diff=0.02 146.01 diff=0.02 146.10 diff=0.11` [download] Note that the results you gave are inconsistent with the data you gave. I have altered 122.45 to 122.43 to give the same outlier result but that has altered the deltas for the second group. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Grouping numbers by davido (Cardinal) on Jul 10, 2006 at 06:22 UTC
I've enjoyed pondering this question. There is a module on CPAN that seems appropriate. It is Set::Partition::SimilarValues. It was a little tricky to figure out how to get it to partition in a way that matched your needs, but this seems to do the trick. You may have to adjust the `GroupSeparationFactor` to suit your needs, but the rest is pretty much automatic. Here's a snippet: `use strict; use warnings; use Data::Dumper; use Set::Partition::SimilarValues; chomp( my( @numbers ) = <DATA> ); my $set_obj = Set::Partition::SimilarValues->new( GroupSeparationFactor => 1.15 ); my @sets = $set_obj->find_groups( @numbers ); print Dumper \@sets; __DATA__ 100.20 100.23 100.35 122.45 122.55 122.67 122.75 145.88 145.97 146.01 146.10` [download] ...and the output... `$VAR1 = [ [ '100.20', '100.23', '100.35' ], [ '122.45', '122.55', '122.67', '122.75' ], [ '145.88', '145.97', '146.01', '146.10' ] ];` [download] Enjoy! Dave	[reply] [d/l] [select]
Re^2: Grouping numbers by BrowserUk (Patriarch) on Jul 10, 2006 at 07:38 UTC
This doesn't quite meet the OP's spec. which calls for boundaries limited by a absolute difference from the mean, whereas the group separation factor is a factor. Eg. `122.45 => 0.155* greater than 0.15 122.55 => 0.055 122.67 => 0.065 122.75 + => 0.145 ====== 490.42 /4 = 122.605` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Grouping numbers by BrowserUk (Patriarch) on Jul 10, 2006 at 08:24 UTC
How optimal do your groupings need to be? And what do you consider to be optimal? For example, in the following set of numbers, those algorithms that start a new group when adding the next value causes the current grouping to break the limit would group the first two values, 122.45 & 122.55 in a group by themselves as adding the third value, 122.67 causes the first value to be outside the lower boundary. The next five values can then form a second group. However, isolating that first value into it's own group, allows the next 4 values to become a group, leaving the last two as a third group. Not well demonstrated, trying to come up with a good demonstration is labourious, but it is possible to see that sometimes it will be better to drop the first value from a grouping rather than the last added in order to incorporate as many values as possible into a group. Hence the questions at the top. Possibilities for optimal might be Smallest number of groups Biggest group sizes. Even group sizes. Quickest to calculate (that doesn't place each value in a separate group :) Other? `values 2 3 4 5 6 7 122.35 122.517 122.455 122.488 122.483 122.533 low limit 122.45 122.5 122.55 122.567 122.605 122.67 122.638 122.633 122.75 122.683 122.77 122.79 122.80 122.65 122.817 122.755 122.788 122.783 122.833 high limit` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Grouping numbers by CountZero (Bishop) on Jul 10, 2006 at 14:39 UTC
Some monks have pointed to Algorithm::Cluster which relies on the C-clustering library. Unless you have a C-compiler handy, you can simply use ppm to install a copy of Algorithm::Cluster. Just issue `ppm install http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/Algorithm-Cluster.ppd` CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.