kennethk has asked for the wisdom of the Perl Monks concerning the following question:

I have a number of data sets of 0, 1, 2 and 3 independent variables, where each point has a value and an uncertainty associated with it. Some sets contain only a handful of points whereas others contain more than 1,000. Having already analyzed the sets and performed some regression/curve fitting, I would like to be able to generate plots to judge the merit of the fit. Because this is all being done in a web context, I only wish to include the 50 best points, to keep the noise and network load down. These points firstly should be fairly well distributed and, when given a choice, have a low relative uncertainty associated.

My first cut is fairly straight forward. I randomly generated a set of 100 points with uncertainty characteristics and distribution roughly similar to one of the 1-d data sets

#!/usr/bin/perl use strict; use warnings; my @points; for (1 .. 100) { my $x = rand()**2; my $y = function($x); my $dy = 0.01 + $x*0.1*rand()**2; $y += offset($dy); push @points, [map sprintf("%.4f", $_), $x, $y, $dy]; } $, = "\n"; $" = "\t"; print map("@$_", @points), "\n"; sub function { my ($x) = @_; return 0.25 + 0.75*(1-$x)**0.25; } sub offset { my ($x) = @_; return $x*(rand()-0.5)*exp(rand()); }
And then attempted to filter points out based on some weighted sum of the inverse of the relative uncertainty (y/dy) and the sum of the squares of all distances to all other points.
#!/usr/bin/perl use strict; use warnings; my @points; push @points, [split] while <DATA>; $_ = {x => $_->[0], 'y' => $_->[1], dy => $_->[2]} for @points; for my $point (@points) { $point->{value} = $point->{y}/$point->{dy}; # Low relative uncerta +inty is good $point->{dist} += ($_->{x} - $point->{x})**2 for @points; # Isolat +ed points are good } while (@points > 50) { my $low_score = score($points[0]); my $low_index = 0; for my $i (0 .. $#points) { if (score($points[$i]) < $low_score) { $low_score = score($points[$i]); $low_index = $i; } } my $low_point = splice @points, $low_index, 1; $_->{dist} -= ($_->{x} - $low_point->{x})**2 for @points; } for my $point (@points) { print join "\t", $point->{x}, $point->{y}, $point->{dy}, "\n"; } sub score { my ($point) = @_; return $point->{value} + 1000*$point->{dist}; } __DATA__ 0.3137 0.9230 0.0213 0.0353 0.9927 0.0028 0.5431 0.8256 0.0893 0.4871 0.8657 0.0098 0.5935 0.8637 0.0645 0.5340 0.7873 0.0932 0.1667 0.9532 0.0062 0.0241 0.9978 0.0070 0.1068 0.9785 0.0142 0.3129 0.9014 0.0286 0.3319 0.9078 0.0028 0.0006 0.9791 0.0429 0.1436 0.9697 0.0285 0.0925 0.9746 0.0053 0.1347 0.9641 0.0024 0.4304 0.8877 0.0364 0.1306 0.9987 0.0502 0.4056 0.8653 0.0634 0.0539 1.0031 0.0352 0.0468 0.9217 0.0943 0.0397 0.9802 0.0503 0.1994 0.9252 0.0486 0.0345 0.9916 0.0013 0.5689 0.8306 0.0040 0.2189 1.0476 0.0927 0.4173 0.8850 0.0149 0.8645 0.6728 0.0556 0.0586 0.9900 0.0819 0.4643 0.7884 0.0659 0.3554 0.9173 0.0410 0.4400 0.9074 0.0831 0.0382 1.0116 0.0461 0.0188 0.9934 0.0069 0.1505 0.9898 0.0540 0.3959 0.8792 0.0195 0.5443 0.8225 0.0148 0.9776 0.5572 0.0651 0.2192 0.9609 0.0370 0.4869 0.8621 0.0211 0.0499 0.9876 0.0011 0.2295 0.9325 0.0204 0.1222 0.9540 0.0418 0.2911 0.9215 0.0015 0.6095 0.8270 0.0235 0.2518 0.8968 0.0456 0.1540 0.9798 0.0529 0.0344 1.0038 0.0227 0.1534 0.9149 0.0608 0.8540 0.6901 0.0060 0.8015 0.7142 0.0249 0.0008 1.0756 0.0734 0.1346 0.9768 0.0183 0.1167 0.9697 0.0013 0.9690 0.5879 0.0023 0.3193 0.9115 0.0036 0.1110 1.0015 0.0546 0.0373 1.0053 0.0231 0.0016 0.9734 0.0946 0.1820 0.9648 0.0209 0.0091 0.9976 0.0013 0.1699 0.9791 0.0343 0.4183 0.8819 0.0010 0.2560 0.9270 0.0091 0.9714 0.5897 0.0251 0.0014 0.9733 0.0304 0.0940 0.9597 0.0524 0.0501 0.9869 0.0530 0.3157 0.9057 0.0220 0.2149 0.9412 0.0047 0.1635 0.9578 0.0011 0.0581 1.0069 0.0311 0.5995 0.8179 0.0056 0.0625 0.9535 0.0919 0.0262 1.0081 0.0191 0.0267 1.0211 0.0902 0.9126 0.6475 0.0020 0.4535 0.8672 0.0085 0.3254 0.9136 0.0065 0.0019 0.9989 0.0592 0.0639 0.9791 0.0083 0.1951 0.9082 0.0838 0.2899 0.9061 0.0741 0.5317 0.8421 0.0010 0.1340 0.9659 0.0012 0.0617 0.9754 0.0609 0.5550 0.8363 0.0087 0.1242 0.9704 0.0071 0.4895 0.8538 0.0040 0.0087 0.9990 0.0064 0.8741 0.6770 0.0111 0.2771 0.9227 0.0024 0.0887 0.9354 0.0545 0.2974 0.9186 0.0010 0.4166 0.8802 0.0093 0.0010 1.0137 0.0992 0.0298 0.9960 0.0119 0.1383 0.9629 0.0034 0.3617 0.8983 0.0018 0.0035 0.9874 0.0340 0.7896 0.7058 0.0764

And it fails terribly. With a relatively even weighting, the low uncertainty points win out and I get a strong cluster at low x. When I increase the weight on the distance term, representation at high x improves, but I end up with big holes in the middle of the graph. Any ideas on how to improve this algorithm? The ideas need to be extensible to higher dimensionality. Speed is not paramount, but is always appreciated.

Replies are listed 'Best First'.
Re: Picking the best points
by GrandFather (Saint) on Oct 29, 2010 at 08:03 UTC

    I'd partition the data into a number of buckets that match the number of points you want to end up with, then select the best value from each bucket according to whatever weighting is appropriate. Consider:

    #!/usr/bin/perl use strict; use warnings; my $numBuckets = 20; my @points = map {{x => $_->[0], 'y' => $_->[1], dy => $_->[2]}} map {chomp; [split]} <DATA>; my @buckets; my $min = $points[0]{x}; my $max = $points[0]{x}; for my $point (@points) { $min = $point->{x} if $min > $point->{x}; $max = $point->{x} if $max < $point->{x}; } my $scale = ($max - $min) / $numBuckets; push @{$buckets[($_->{x} - $min) / $scale]}, $_ for @points; for my $bucket (@buckets) { # Sort contents of bucket by weighting function next if !defined $bucket; @$bucket = sort {$a->{dy} <=> $b->{dy}} @$bucket; } for my $index (0 .. $numBuckets - 1) { printf "%3d: ", $index; printf "%.4f, %.4f, %.4f", @{$buckets[$index][0]}{qw(x y dy)} if defined $buckets[$index]; print "\n"; }

    using the data in the OP prints:

    0: 0.0345, 0.9916, 0.0013 1: 0.0499, 0.9876, 0.0011 2: 0.1340, 0.9659, 0.0012 3: 0.1635, 0.9578, 0.0011 4: 0.2149, 0.9412, 0.0047 5: 0.2911, 0.9215, 0.0015 6: 0.2974, 0.9186, 0.0010 7: 0.3617, 0.8983, 0.0018 8: 0.4183, 0.8819, 0.0010 9: 0.4535, 0.8672, 0.0085 10: 0.5317, 0.8421, 0.0010 11: 0.5689, 0.8306, 0.0040 12: 0.5995, 0.8179, 0.0056 13: 14: 15: 16: 0.8015, 0.7142, 0.0249 17: 0.8540, 0.6901, 0.0060 18: 0.9126, 0.6475, 0.0020 19: 0.9690, 0.5879, 0.0023

    which has drawn too few points because the actual distribution in the original data is very lumpy. If you need a fixed number of points and it is likely that you won't get at least one datum in each bucket, then I'd select further points from the buckets with the greatest number of points in them.

    True laziness is hard work
      The actual datasets are actually quite lumpy - the source data are thermophysical property measurements, so for example, sets tend to have a very large number of points near 25 C. I like the bucket oriented process, though I am not opposed to keeping points that are near each other. Frequently points that are proximate in space and have low reported uncertainties may still disagree with each other, which is why I'd like to still keep a fairly large number of points. Preliminarily, I'm favoring BrowserUK's suggestion, though I might use buckets as an initial pass to guarantee good spatial coverage depending on some empirical testing. And in any case, your result looks better than mine.
Re: Picking the best points
by BrowserUk (Patriarch) on Oct 29, 2010 at 14:38 UTC

    In order to reduce your dataset N to the required number of points M, you need to discard the (N-M) "least valuable" points.

    Determination of value must entail two criteria:

    1. if a given point is the only point within some region of the graph, it is valuable.
    2. If a given point is "close" to another point, the least valuable is the one with the greatest uncertainty ($dy).

    As there is no fixed ratio of N to M; nor is the distribution "even", the specification of "close" will need to evolve throughout the discard process.

    One approach to this would be to consider the two closest points first and discard the one with the greatest uncertainty.

    Then consider the next closest (now closest) two points, and again discard the least certain.

    Continue until either the M target has been achieved; (or the closest pair are too far apart to be considered close).

    The following code does this. The relevant part of it is the while( @ordered > $RETAIN ){ loop.

    With most of the rest just plotting two (offset) graphs of the before (red) and after (green) data to allow me to visualise the results. On those graphs, the size of the circle around each point is the uncertainty (dy) value. In the after graph, the absence of any large circles shows the effectiveness of the run.

    This is the result of using your generator to get 1000 points, which are then reduced to 50:

    c:\test>868223-gen -N=1000 | 868223-plot -RETAIN=50

    The code:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Picking the best points
by Urthas (Novice) on Oct 29, 2010 at 08:02 UTC
    What version of Perl are you using? The reason I ask is because prior to 5.004, srand was not automatically called with the first call to rand. Not that I think this is likely here, but it's worth mentioning since you rely on rand so heavily. Anyways, my gut feeling is that your distance metric is flawed. I'm no math wizard (I'm no *anything* wizard), but what happens when you do a Manhattan or hypotenuse distance measure between points, rather than considering only the x-axis?
      The actual randomness of the points is irrelevant - this is just sample data. And in this case, this is 1-dimensional data - points have value and uncertainty.