Points on a line and associated intervals

srdst13 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Points on a line and associated intervals by BrowserUk (Patriarch) on May 18, 2006 at 03:43 UTC
The goal is to produce the highest quality set of points at the most uniform spacing possible. Your algorithm as described will not achieve this. It will effectively (though not quite), select the 10 highest quality points, but do it in a very complicated way. Although you are selecting the smallest interval to discard a point from each time, when you consolidate the remaining point to a new interval, you are taking no account of the relative size of the new intervals you are creating. This is a (fairly crude) implementation of your algorithm, and the results it typically produces from a randomly generated set of data. I've marked the values retained from the input dataset with an asterix: #! perl -slw use strict; use Data::Dumper; use List::Util qw[ reduce first ]; our $N \|\|= 100; our $MIN \|\|= 10; my @data = sort{ $a->[ 0 ] <=> $b->[ 0 ] } map{ [ rand 1000, int rand +10 ] } 1 .. $N; print Dumper \@data; my @intervals; reduce { push @intervals, [ abs( $data[ $a ][ 0 ] - $data[ $b ][ 0 ] ), $a, + $b ]; $b; } 0 .. $#data; @intervals = sort { $a->[0] <=> $b->[0] } @intervals; while( grep( defined, @data ) > $MIN ) { my( $i, $x, $y ) = @{ shift @intervals }; next unless defined $data[ $x ] and defined $data[ $y ]; my $discard = $data[ $x ][ 1] < $data[ $y ][ 1 ] ? $x : $y; delete $data[ $discard ]; if( $discard == $x ) { --$x while $x > 0 and not defined $data[ $x ]; next if $x == 0 or not defined $data[ $x ]; } else { ++$y while $y < $#data and not defined $data[ $y ]; next if $y == $#data or not defined $data[ $y ]; } my $newInterval = [ abs( $data[ $x ] - $data[ $y ] ), $x, $y ]; my $insertPoint = first{ $intervals[ $_ ][0] > $newInterval->[0] } + 0 .. $#intervals; $insertPoint = $#intervals unless defined $insertPoint; splice @intervals, $insertPoint, 0, $newInterval; } @data = grep defined, @data; print Dumper \@data; __END__ $VAR1 = [ [ '10.101318359375', 0 ], [ '15.472412109375', 8 ], * [ '21.209716796875', 9 ], [ '41.80908203125', 1 ], [ '45.379638671875', 0 ], * [ '72.6318359375', 7 ], [ '80.047607421875', 1 ], [ '89.019775390625', 0 ], * [ '94.90966796875', 8 ], [ '101.470947265625', 6 ], [ '133.36181640625', 5 ], [ '142.913818359375', 6 ], [ '151.2451171875', 6 ], [ '168.15185546875', 2 ], [ '169.403076171875', 7 ], [ '174.01123046875', 5 ], [ '182.037353515625', 8 ], [ '185.9130859375', 7 ], [ '223.57177734375', 1 ], [ '231.99462890625', 7 ], [ '236.02294921875', 5 ], [ '262.359619140625', 3 ], [ '263.031005859375', 7 ], [ '279.1748046875', 1 ], [ '280.303955078125', 0 ], [ '289.73388671875', 8 ], [ '308.41064453125', 3 ], [ '317.657470703125', 1 ], [ '324.5849609375', 6 ], * [ '341.033935546875', 9 ], [ '343.170166015625', 0 ], [ '353.271484375', 3 ], [ '355.438232421875', 6 ], [ '371.27685546875', 3 ], [ '379.241943359375', 3 ], [ '382.62939453125', 0 ], * [ '386.23046875', 9 ], [ '389.34326171875', 3 ], [ '394.47021484375', 0 ], [ '399.4140625', 1 ], [ '407.440185546875', 2 ], [ '425.201416015625', 6 ], [ '426.361083984375', 0 ], [ '438.65966796875', 3 ], [ '447.93701171875', 5 ], [ '455.810546875', 7 ], [ '457.45849609375', 1 ], [ '465.51513671875', 5 ], [ '468.8720703125', 6 ], [ '475.799560546875', 4 ], [ '491.14990234375', 5 ], [ '496.2158203125', 9 ], [ '503.570556640625', 7 ], [ '511.566162109375', 2 ], [ '543.3349609375', 9 ], [ '561.767578125', 6 ], [ '563.690185546875', 0 ], [ '571.8994140625', 2 ], [ '574.89013671875', 3 ], [ '595.21484375', 7 ], [ '600.03662109375', 3 ], [ '608.062744140625', 1 ], [ '615.081787109375', 8 ], [ '631.195068359375', 9 ], [ '634.33837890625', 2 ], [ '638.031005859375', 5 ], [ '639.251708984375', 7 ], [ '672.30224609375', 8 ], [ '672.698974609375', 5 ], [ '678.64990234375', 3 ], [ '690.73486328125', 7 ], [ '710.63232421875', 5 ], [ '718.475341796875', 4 ], [ '761.3525390625', 2 ], [ '770.69091796875', 9 ], [ '790.6494140625', 1 ], * [ '796.417236328125', 8 ], [ '807.647705078125', 4 ], [ '809.38720703125', 1 ], [ '809.9365234375', 3 ], * [ '813.873291015625', 5 ], [ '819.76318359375', 4 ], [ '821.6552734375', 4 ], [ '833.80126953125', 5 ], * [ '843.170166015625', 8 ], [ '843.505859375', 2 ], [ '861.99951171875', 4 ], [ '865.6005859375', 5 ], [ '866.14990234375', 6 ], [ '871.124267578125', 4 ], * [ '890.960693359375', 9 ], [ '911.102294921875', 1 ], [ '912.353515625', 7 ], [ '919.708251953125', 1 ], [ '922.91259765625', 8 ], [ '956.695556640625', 2 ], [ '973.388671875', 5 ], * [ '977.142333984375', 8 ], [ '993.438720703125', 8 ], [ '998.1689453125', 5 ] ]; $VAR1 = [ [ '21.209716796875', 9 ], [ '72.6318359375', 7 ], [ '94.90966796875', 8 ], [ '341.033935546875', 9 ], [ '386.23046875', 9 ], [ '796.417236328125', 8 ], [ '813.873291015625', 5 ], [ '843.170166015625', 8 ], [ '890.960693359375', 9 ], [ '977.142333984375', 8 ] ]; [download] As you can see, not quite the results you are after. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Points on a line and associated intervals by srdst13 (Pilgrim) on May 18, 2006 at 10:56 UTC
Although you are selecting the smallest interval to discard a point from each time, when you consolidate the remaining point to a new interval, you are taking no account of the relative size of the new intervals you are creating. I didn't explain things properly, then. At each iteration, I am taking the smallest interval of the remaining intervals where the two old intervals have been appropriately collapsed to a new interval after each point removal. Sorry if I wasn't clear.	[reply]
Re^3: Points on a line and associated intervals by BrowserUk (Patriarch) on May 18, 2006 at 12:28 UTC
That was clear. The code I posted does that by insertion sorting the new interval back into the sorted list of remaining intervals. Still, the result is the selection of the 10 highest quality points via a very slow algorithm. It will always favour quality over distance, and regularity of spacing is never considered. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Points on a line and associated intervals by BrowserUk (Patriarch) on May 18, 2006 at 06:17 UTC
An algorithm that may approximate your requirements is to divide the dataset into buckets of 10 equal ranges. Then sort each bucket and select the 'best' from each group. To approximate my reading of your criteria, I've sorted the pairs in each group by it's distance from the group median divided by it's weight (quality). This tends to favour quality over distance. I also tried the square of the distance divided by the weight, which tends to strongly favour proximity to the median, over quality. The sort can be easily adjusted to produce the results that best suit your purpose. It handles 1e4 in < 1 sec; 1e5 in ~5 secs; 5e5 in ~38 secs; & 1e6 in ~ 97 secs; which is close enough for me to consider the implementation as O(n) in terms of time. It will vary with ratio of data to selections to some degree, though as the number of selections grow, the individual sorts are smaller, which tends to compensate. It might benefit from a more efficient sort than the ST I've used. #! perl -slw use strict; use Data::Dumper; use List::Util qw[ reduce first ]; our $N \|\|= 100; our $MIN \|\|= 10; my @data = sort{ $a->[ 0 ] <=> $b->[ 0 ] } map{ [ rand 1000, 1+ int rand 9 ] } 1 .. $N; #print Dumper \@data; my $min = reduce{ $a->[0] < $b->[ 0 ] ? $a : $b } @data; my $max = reduce{ $a->[0] > $b->[ 0 ] ? $a : $b } @data; my $gap = ( $max->[ 0 ] - $min->[ 0 ] ) / $MIN ; #print "$min->[0]:$max->[0]:$gap"; my @buckets; push @{ $buckets[ ( $data[ $_ ][ 0 ] - $min->[ 0 ] -1 ) / $gap ] }, $data[ $_ ] for 0 .. $#data; #print Dumper \@buckets; my @selected; for my $bucket ( 0 .. $#buckets ) { my $mid = (( $bucket+1 ) * $gap ) - ( $gap / 2 ) + $min->[ 0 ]; $selected[ $bucket ] = ( map{ $_->[ 1 ] } sort { $a->[ 0 ] <=> $b->[ 0 ] } map { [ abs( $_->[ 0 ] - $mid ) / ( $_->[ 1 ]\|\|1 ), $_ ] } @{ $buckets[ $bucket ] } )[ 0 ]; } my $next = 0; my $group = $min->[ 0 ]; printf "\nGroup: %f - %f; median: %f\n", $group, $group + $gap, $group + ( $gap / 2 ); $group += $gap; for my $idx ( 0 .. $#data ) { if( $data[ $idx ][ 0 ] > $group ) { printf "\nGroup: %f - %f; median: %f\n", $group, $group + $gap, $group + ( $gap / 2 ); $group += $gap; } my $selected = ' '; if( $next < @selected and $data[ $idx ][ 0 ] == $selected[ $next ] +[ 0 ] ) { $selected = $next; ++$next; } printf "%1s\t[ @{ $data[ $idx ] } ]\n", $selected; } [download] in the results, the randomly chosen values (0..1000), and weights (0..9), are displayed in equal interval groups, with the selected pair from each group prefixed with '*'. Read more... (3 kB)	[reply] [d/l] [select]
Re: Points on a line and associated intervals (metrics) by tye (Sage) on May 18, 2006 at 07:02 UTC
It helps to come up with some metrics that indicate how good a solution is to you. What is the disadvantage of having wide intervals? Can you try to calculate a number representing how much such hurts your results and numbers for how much the quality of the points affects the value of a particular solution? Even if you don't get great metrics, just attempting to may help you understand more precisely how you want to balance these goals against each other. I might approach this problem from "the opposite" direction. Add the points one-at-a-time, in order from highest to lowest quality. As you go, keep track of the largest interval left. When the largest interval is close enough to $length/($finalPointCount+1) (perhaps as a function of the quality of the worst point required), then switch to removing points. Starting at the worst point, remove it only if doing so wouldn't create an interval larger than desired (so you'd never remove the last point added). Move on to the next-worst point... I don't think you can absolutely predict how many points you'll have at the end of such a run. But you can specify the minimum interval size you'd get and the minimum point quality. A few other methods come to mind but they are all motivated by possible metrics I've guessed at (such as, is it relatively okay to have a large interval if it is "next to" a really high quality point?) and I don't know which metrics make sense for your problem, so I'll wait for more info at this point. - tye	[reply]
Re: Points on a line and associated intervals by TedPride (Priest) on May 18, 2006 at 03:05 UTC
Use a heap structure, where the "largest" item is the item with the shortest interval. For however many intervals you want to remove, then, all you have to do is take the top interval from the heap, modify the interval on the side you're removing the point from (you'll have links to adjacent intervals for each interval), and reheap it to its new, higher position. The interval on the other side will just change its link to the interval on this side, since the interval is now adjacent to it. Each interval removal should take at worst (and probably not even close to this) O(lg n), so the total complexity is O(s lg n) where s is the set number of intervals to remove, plus of course O(n lg n) to build the original heap. Easily manageable for millions of intervals. Or you could calculate where each remaining point needs to be on your line, given its total length and the number of points you want, then choose the point closest to that position (or weight the points nearest to it and come up with an average) and use that instead.	[reply]
Re: Points on a line and associated intervals by turo (Friar) on May 18, 2006 at 01:56 UTC
i'm sleepy, but here it goes set an array of N element {distance=-1; point1; point2} for each element on your list take two adjacent elements, and calculate its distance compare the distance with the first element of the array. if the array element.distance is -1, then put your distance, and the info for your two adjacent points if you distance is lower, then, rotate the array to the right and put your elment to the first instance. Your array must be something like this `[new_element, old 1, old 2,.., old n-1]` if your distance is higher, try to compare with the next element of the array and start the point 2.2. When you finish, you will have (i think) an array of 10 elemnts (at complexity O(n)), which are the ones you are searching for (supposing, of course, that your points are consecutive between them ... umm ... i'm wondering if i really grasp the problem ... i'll go to bed) Good night `perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'`	[reply] [d/l] [select]
Re: Points on a line and associated intervals by GrandFather (Saint) on May 18, 2006 at 01:56 UTC
It would probably help a lot if you provide a sample data set (data pairs I guees to provide the position and quality values) and an indication of a "good" set of result points. On the face of it one approach may be to bin the points then select a point for each bin based on a "distance from bin centre" weight and the quality for each point in the bin. That would get the distribution about right by selecting lower quality points as needed to maintain a reasonable distribution. DWIM is Perl's answer to Gödel	[reply]

Update