Re: Picking the best points

I'd partition the data into a number of buckets that match the number of points you want to end up with, then select the best value from each bucket according to whatever weighting is appropriate. Consider:

#!/usr/bin/perl
use strict;
use warnings;

my $numBuckets = 20;
my @points =
    map {{x => $_->[0], 'y' => $_->[1], dy => $_->[2]}}
    map {chomp; [split]} <DATA>;
my @buckets;
my $min = $points[0]{x};
my $max = $points[0]{x};

for my $point (@points) {
    $min = $point->{x} if $min > $point->{x};
    $max = $point->{x} if $max < $point->{x};
}

my $scale = ($max - $min) / $numBuckets;

push @{$buckets[($_->{x} - $min) / $scale]}, $_ for @points;

for my $bucket (@buckets) {
    # Sort contents of bucket by weighting function
    next if !defined $bucket;
    @$bucket = sort {$a->{dy} <=> $b->{dy}} @$bucket;
}

for my $index (0 .. $numBuckets - 1) {
    printf "%3d: ", $index;
    printf "%.4f, %.4f, %.4f", @{$buckets[$index][0]}{qw(x y dy)}
        if defined $buckets[$index];
    print "\n";
}
[download]

using the data in the OP prints:

  0: 0.0345, 0.9916, 0.0013
  1: 0.0499, 0.9876, 0.0011
  2: 0.1340, 0.9659, 0.0012
  3: 0.1635, 0.9578, 0.0011
  4: 0.2149, 0.9412, 0.0047
  5: 0.2911, 0.9215, 0.0015
  6: 0.2974, 0.9186, 0.0010
  7: 0.3617, 0.8983, 0.0018
  8: 0.4183, 0.8819, 0.0010
  9: 0.4535, 0.8672, 0.0085
 10: 0.5317, 0.8421, 0.0010
 11: 0.5689, 0.8306, 0.0040
 12: 0.5995, 0.8179, 0.0056
 13: 
 14: 
 15: 
 16: 0.8015, 0.7142, 0.0249
 17: 0.8540, 0.6901, 0.0060
 18: 0.9126, 0.6475, 0.0020
 19: 0.9690, 0.5879, 0.0023
[download]

which has drawn too few points because the actual distribution in the original data is very lumpy. If you need a fixed number of points and it is likely that you won't get at least one datum in each bucket, then I'd select further points from the buckets with the greatest number of points in them.

True laziness is hard work

Comment on Re: Picking the best points Select or Download Code

Replies are listed 'Best First'.
Re^2: Picking the best points by kennethk (Abbot) on Oct 29, 2010 at 16:50 UTC
The actual datasets are actually quite lumpy - the source data are thermophysical property measurements, so for example, sets tend to have a very large number of points near 25 C. I like the bucket oriented process, though I am not opposed to keeping points that are near each other. Frequently points that are proximate in space and have low reported uncertainties may still disagree with each other, which is why I'd like to still keep a fairly large number of points. Preliminarily, I'm favoring BrowserUK's suggestion, though I might use buckets as an initial pass to guarantee good spatial coverage depending on some empirical testing. And in any case, your result looks better than mine.	[reply]

Replies are listed 'Best First'.

Re^2: Picking the best points
by kennethk (Abbot) on Oct 29, 2010 at 16:50 UTC

BrowserUK

[reply]