It sounds like you may want to use K-means clustering (or a variant thereof). Algorithm::Cluster has a number of clustering routines, including K-means. I haven't used it, but it might be a good place to start.
HTH
| [reply] |
Hmm, bobf, that's a very nice reference, but what he describes fits the QT-clust algorithm described just below much better than the K-means method. It's distinctly the threshold criterion he's using, and there's no fixed number of clusters.
| [reply] |
As bobf points out, your problem is very much like a statistical clustering task, where the object is to figure out how data is or shoud be grouped. If finding statistical clusters is your underlying task, and you want to find something close to an optimal clustering, then the Algorithm::Cluster is a good bet. It will, however, involve installing the underlying C software, and investing some time learning about clustering.
On the other hand, you may just want to do what you said, and are satisfied with any grouping that satisfies the grouping criterion, even if it's a poor grouping. If any clustering is good enough, then you can just go through the sorted list computing an average (or a running average). Each time you look at a new point, check if the new average with that point included puts the first point in the group outside the 'X' distance from the average, or puts the new point you're going to add outside the distance. If the new point expands the group too much, then all points up to that one form a group, and the new point is the first of the next group.
my @refsOfGroups = ();
my @group;
my $avg;
my $point;
while ( @pointValues) {
$point = shift @pointValues;
$avg = average(@group,$point);
if ( ($avg - $group[0] > $x_threshold)
||($point - $avg > $x_threshold)
) {
push @refsOfGroups, [@group];
@group = ()
}
push @group $point; # include the last group
} # while
push @refsOfGroups, [@group];
(The code is untested, but it's pretty simple.)
Note that if you have a lot of points and a lot of big groups, the computation of average() can switche to a running average, where you don't add up all the points each time, you just add in the new value. (This also helps prevent overflow when the sums in the average get too big). Also note that you don't have to take the absolute value of the difference from $avg in the if(), because you know $avg at least as big as the first, and no larger then the one you just included in the average.
I hope this is what you need, or that it's good enough, because this is simple to do, and statistical clustering isn't, even if someone else has written the routines. Good luck with your clustering. | [reply] [d/l] |
The following code seems to do what you are after, although a module solution may be more appropriate.
use warnings;
use strict;
use constant kMaxDelta => 0.15;
my @rawData = (<DATA>);
chomp @rawData;
@rawData = sort {$a <=> $b} @rawData;
my @groups;
my @currGroup = (shift @rawData);
for (@rawData, undef) {
if (defined $_ and abs ($_ - sum (@currGroup, $_) / (@currGroup +
+1)) <= kMaxDelta) {
# Ok to add to current group
push @currGroup, $_;
} else {
# time for a new group
push @groups, [@currGroup];
@currGroup = ($_);
}
}
my $count = 1;
for my $group (@groups) {
if (@$group == 1) {
print "Outlier: $group->[0]\n\n";
} else {
my $avg = sum (@$group) / @$group;
my @outliers;
my @ok;
abs ($avg - $_) > kMaxDelta ? push @outliers, $_ : push @ok, $
+_
for @$group;
printf "group$count: Average: %.2f\n", $avg;
printf "$_ diff=%.2f\n", abs ($avg - $_) for (@ok);
print "\n";
++$count;
printf "Outlier: $_ since dif=%.2f>X\n\n", abs ($avg - $_) for
+ @outliers;
}
}
sub sum {
my $sum = shift;
$sum += $_ for @_;
return $sum;
}
__DATA__
100.20
100.23
100.35
122.43
122.55
122.67
122.75
145.88
145.97
146.01
146.10
Prints:
group1: Average: 100.26
100.20 diff=0.06
100.23 diff=0.03
100.35 diff=0.09
group2: Average: 122.60
122.55 diff=0.05
122.67 diff=0.07
122.75 diff=0.15
Outlier: 122.43 since dif=0.17>X
group3: Average: 145.99
145.88 diff=0.11
145.97 diff=0.02
146.01 diff=0.02
146.10 diff=0.11
Note that the results you gave are inconsistent with the data you gave. I have altered 122.45 to 122.43 to give the same outlier result but that has altered the deltas for the second group.
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
I've enjoyed pondering this question. There is a module on CPAN that seems appropriate. It is Set::Partition::SimilarValues. It was a little tricky to figure out how to get it to partition in a way that matched your needs, but this seems to do the trick. You may have to adjust the GroupSeparationFactor to suit your needs, but the rest is pretty much automatic. Here's a snippet:
use strict;
use warnings;
use Data::Dumper;
use Set::Partition::SimilarValues;
chomp( my( @numbers ) = <DATA> );
my $set_obj = Set::Partition::SimilarValues->new(
GroupSeparationFactor => 1.15
);
my @sets = $set_obj->find_groups( @numbers );
print Dumper \@sets;
__DATA__
100.20
100.23
100.35
122.45
122.55
122.67
122.75
145.88
145.97
146.01
146.10
...and the output...
$VAR1 = [
[
'100.20',
'100.23',
'100.35'
],
[
'122.45',
'122.55',
'122.67',
'122.75'
],
[
'145.88',
'145.97',
'146.01',
'146.10'
]
];
Enjoy!
| [reply] [d/l] [select] |
122.45 => 0.155* greater than 0.15
122.55 => 0.055
122.67 => 0.065
122.75 + => 0.145
======
490.42 /4 = 122.605
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
How optimal do your groupings need to be? And what do you consider to be optimal?
For example, in the following set of numbers, those algorithms that start a new group when adding the next value causes the current grouping to break the limit would group the first two values, 122.45 & 122.55 in a group by themselves as adding the third value, 122.67 causes the first value to be outside the lower boundary. The next five values can then form a second group.
However, isolating that first value into it's own group, allows the next 4 values to become a group, leaving the last two as a third group.
Not well demonstrated, trying to come up with a good demonstration is labourious, but it is possible to see that sometimes it will be better to drop the first value from a grouping rather than the last added in order to incorporate as many values as possible into a group.
Hence the questions at the top. Possibilities for optimal might be
- Smallest number of groups
- Biggest group sizes.
- Even group sizes.
- Quickest to calculate (that doesn't place each value in a separate group :)
- Other?
values 2 3 4 5 6 7
122.35 122.517 122.455 122.488 122.483 122.533 low limit
122.45
122.5
122.55 122.567
122.605
122.67 122.638
122.633
122.75 122.683
122.77
122.79
122.80
122.65 122.817 122.755 122.788 122.783 122.833 high limit
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Some monks have pointed to Algorithm::Cluster which relies on the C-clustering library. Unless you have a C-compiler handy, you can simply use ppm to install a copy of Algorithm::Cluster. Just issue ppm install http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/Algorithm-Cluster.ppd
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] |