in reply to Picking the best points

I'd partition the data into a number of buckets that match the number of points you want to end up with, then select the best value from each bucket according to whatever weighting is appropriate. Consider:

#!/usr/bin/perl use strict; use warnings; my $numBuckets = 20; my @points = map {{x => $_->[0], 'y' => $_->[1], dy => $_->[2]}} map {chomp; [split]} <DATA>; my @buckets; my $min = $points[0]{x}; my $max = $points[0]{x}; for my $point (@points) { $min = $point->{x} if $min > $point->{x}; $max = $point->{x} if $max < $point->{x}; } my $scale = ($max - $min) / $numBuckets; push @{$buckets[($_->{x} - $min) / $scale]}, $_ for @points; for my $bucket (@buckets) { # Sort contents of bucket by weighting function next if !defined $bucket; @$bucket = sort {$a->{dy} <=> $b->{dy}} @$bucket; } for my $index (0 .. $numBuckets - 1) { printf "%3d: ", $index; printf "%.4f, %.4f, %.4f", @{$buckets[$index][0]}{qw(x y dy)} if defined $buckets[$index]; print "\n"; }

using the data in the OP prints:

0: 0.0345, 0.9916, 0.0013 1: 0.0499, 0.9876, 0.0011 2: 0.1340, 0.9659, 0.0012 3: 0.1635, 0.9578, 0.0011 4: 0.2149, 0.9412, 0.0047 5: 0.2911, 0.9215, 0.0015 6: 0.2974, 0.9186, 0.0010 7: 0.3617, 0.8983, 0.0018 8: 0.4183, 0.8819, 0.0010 9: 0.4535, 0.8672, 0.0085 10: 0.5317, 0.8421, 0.0010 11: 0.5689, 0.8306, 0.0040 12: 0.5995, 0.8179, 0.0056 13: 14: 15: 16: 0.8015, 0.7142, 0.0249 17: 0.8540, 0.6901, 0.0060 18: 0.9126, 0.6475, 0.0020 19: 0.9690, 0.5879, 0.0023

which has drawn too few points because the actual distribution in the original data is very lumpy. If you need a fixed number of points and it is likely that you won't get at least one datum in each bucket, then I'd select further points from the buckets with the greatest number of points in them.

True laziness is hard work

Replies are listed 'Best First'.
Re^2: Picking the best points
by kennethk (Abbot) on Oct 29, 2010 at 16:50 UTC
    The actual datasets are actually quite lumpy - the source data are thermophysical property measurements, so for example, sets tend to have a very large number of points near 25 C. I like the bucket oriented process, though I am not opposed to keeping points that are near each other. Frequently points that are proximate in space and have low reported uncertainties may still disagree with each other, which is why I'd like to still keep a fairly large number of points. Preliminarily, I'm favoring BrowserUK's suggestion, though I might use buckets as an initial pass to guarantee good spatial coverage depending on some empirical testing. And in any case, your result looks better than mine.