Re: Making sense of data: Clustering OR A coding challenge

Something similar to k-means can be archieved with Statistics::Descriptive and its frequency_distribution.

use Statistics::Descriptive;
$stat = Statistics::Descriptive::Full->new();
$stat->add_data( split /,/, <DATA> );
%f   = $stat->frequency_distribution(4);

$min = 0;
for ( sort { $a <=> $b } keys %f ) {
    printf "[%d\t-%d\t] %d\n", $min, $_, $f{$_};
    $min = $_ + 1;
}
__DATA__
0, 12, 25, 38, 50, 62, 75, 88, 100
__END__
# prints 
[0      -25     ] 3
[26     -50     ] 2
[51     -75     ] 2
[76     -100    ] 2
[download]

Comment on Re: Making sense of data: Clustering OR A coding challenge Download Code

Replies are listed 'Best First'.
Re^2: Making sense of data: Clustering OR A coding challenge by belg4mit (Prior) on Apr 04, 2006 at 15:24 UTC
Interesting, although it seems this method seems to favor outliers. With my sample dataset below it creates more bins with few or no items on the high end of the spectrum whereas nearly everything else continues to be lumped into the first category. `[0 -296 ] 86 [297 -580 ] 8 [581 -864 ] 4 [865 -1148 ] 1 [1149 -1432 ] 2 [1433 -1716 ] 0 [1717 -2000 ] 1` [download] vs. kvale's `[12 -24 ] 33 [42 -76 ] 27 [80 -128 ] 14 [150 -250 ] 9 [280 -460 ] 9 [550 -950 ] 7 [1226 -2000 ] 3` [download] Read more... (448 Bytes) `-- In Bob We Trust, All Others Bring Data.`	[reply] [d/l] [select]
Re^3: Making sense of data: Clustering OR A coding challenge by codeacrobat (Chaplain) on Apr 05, 2006 at 06:21 UTC
Then you might be interested in the `$stat->frequency_distribution(\@bins);` notation and add more bins at lower values.	[reply] [d/l]
Re^4: Making sense of data: Clustering OR A coding challenge by belg4mit (Prior) on Apr 05, 2006 at 14:19 UTC
Except that the point was to find the natural bins inherent in the data :-P `-- In Bob We Trust, All Others Bring Data.`	[reply]