srdst13 has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of data that I want to do a particular operation on to consolidate from a larger set to a smaller, cleaner set. The data consists of a set of points on a line, with each point having a "quality" associated with it. An interval is simply the distance between two adjacent points. I would like to start with a large number of such points (that I generate) and consolidate them as follows:

  1. Find shortest interval
  2. Choose one of the two points in the interval to remove based on the point quality
  3. After removing the lowest quality point, make a new interval by joining the points adjacent to the removed point
  4. LOOP until set number of intervals obtained

The goal is to produce the highest quality set of points at the most uniform spacing possible. My question, now, is pretty vague. I can pretty easily figure A way to do this, but I will have perhaps a million points, so doing things like sorting after each iteration will probably be prohibitive. Does anyone have any clever data structures that might be beneficial here?


Thanks,
Sean

Update

I am trying a heap for managing the intervals (so that I can quickly choose the shortest interval). I am using two basic objects, a "point" and an "interval". The interval has an associated distance along with references to each of the points that form the interval. The "point" object has a quality and a location as well as a reference to each of the one (at the ends of the line) or two intervals. So, I choose the shortest interval, choose the worse of the two points in that interval in terms of quality, and then do some reference swapping to remove the old interval and point. Finally, one of the adjacent intervals is adjusted to bridge the gap left by removing the point and distance is recalculated. Heap::Simple::XS allows one to use a method call to arrange to get the top interval (shortest) from the heap, so at the expense of a bit of efficiency, simply altering the affected interval suffices to have it end up in the correct location on the heap.

When I get the code completed, I'll flesh out some details here, but with simple mock data, 1e6 intervals is quite doable in a matter of minutes, which is acceptible.

Thanks to all for the various suggestions.

  • Comment on Points on a line and associated intervals

Replies are listed 'Best First'.
Re: Points on a line and associated intervals
by BrowserUk (Patriarch) on May 18, 2006 at 03:43 UTC
    The goal is to produce the highest quality set of points at the most uniform spacing possible.

    Your algorithm as described will not achieve this. It will effectively (though not quite), select the 10 highest quality points, but do it in a very complicated way.

    Although you are selecting the smallest interval to discard a point from each time, when you consolidate the remaining point to a new interval, you are taking no account of the relative size of the new intervals you are creating.

    This is a (fairly crude) implementation of your algorithm, and the results it typically produces from a randomly generated set of data. I've marked the values retained from the input dataset with an asterix:

    #! perl -slw use strict; use Data::Dumper; use List::Util qw[ reduce first ]; our $N ||= 100; our $MIN ||= 10; my @data = sort{ $a->[ 0 ] <=> $b->[ 0 ] } map{ [ rand 1000, int rand +10 ] } 1 .. $N; print Dumper \@data; my @intervals; reduce { push @intervals, [ abs( $data[ $a ][ 0 ] - $data[ $b ][ 0 ] ), $a, + $b ]; $b; } 0 .. $#data; @intervals = sort { $a->[0] <=> $b->[0] } @intervals; while( grep( defined, @data ) > $MIN ) { my( $i, $x, $y ) = @{ shift @intervals }; next unless defined $data[ $x ] and defined $data[ $y ]; my $discard = $data[ $x ][ 1] < $data[ $y ][ 1 ] ? $x : $y; delete $data[ $discard ]; if( $discard == $x ) { --$x while $x > 0 and not defined $data[ $x ]; next if $x == 0 or not defined $data[ $x ]; } else { ++$y while $y < $#data and not defined $data[ $y ]; next if $y == $#data or not defined $data[ $y ]; } my $newInterval = [ abs( $data[ $x ] - $data[ $y ] ), $x, $y ]; my $insertPoint = first{ $intervals[ $_ ][0] > $newInterval->[0] } + 0 .. $#intervals; $insertPoint = $#intervals unless defined $insertPoint; splice @intervals, $insertPoint, 0, $newInterval; } @data = grep defined, @data; print Dumper \@data; __END__ $VAR1 = [ [ '10.101318359375', 0 ], [ '15.472412109375', 8 ], * [ '21.209716796875', 9 ], [ '41.80908203125', 1 ], [ '45.379638671875', 0 ], * [ '72.6318359375', 7 ], [ '80.047607421875', 1 ], [ '89.019775390625', 0 ], * [ '94.90966796875', 8 ], [ '101.470947265625', 6 ], [ '133.36181640625', 5 ], [ '142.913818359375', 6 ], [ '151.2451171875', 6 ], [ '168.15185546875', 2 ], [ '169.403076171875', 7 ], [ '174.01123046875', 5 ], [ '182.037353515625', 8 ], [ '185.9130859375', 7 ], [ '223.57177734375', 1 ], [ '231.99462890625', 7 ], [ '236.02294921875', 5 ], [ '262.359619140625', 3 ], [ '263.031005859375', 7 ], [ '279.1748046875', 1 ], [ '280.303955078125', 0 ], [ '289.73388671875', 8 ], [ '308.41064453125', 3 ], [ '317.657470703125', 1 ], [ '324.5849609375', 6 ], * [ '341.033935546875', 9 ], [ '343.170166015625', 0 ], [ '353.271484375', 3 ], [ '355.438232421875', 6 ], [ '371.27685546875', 3 ], [ '379.241943359375', 3 ], [ '382.62939453125', 0 ], * [ '386.23046875', 9 ], [ '389.34326171875', 3 ], [ '394.47021484375', 0 ], [ '399.4140625', 1 ], [ '407.440185546875', 2 ], [ '425.201416015625', 6 ], [ '426.361083984375', 0 ], [ '438.65966796875', 3 ], [ '447.93701171875', 5 ], [ '455.810546875', 7 ], [ '457.45849609375', 1 ], [ '465.51513671875', 5 ], [ '468.8720703125', 6 ], [ '475.799560546875', 4 ], [ '491.14990234375', 5 ], [ '496.2158203125', 9 ], [ '503.570556640625', 7 ], [ '511.566162109375', 2 ], [ '543.3349609375', 9 ], [ '561.767578125', 6 ], [ '563.690185546875', 0 ], [ '571.8994140625', 2 ], [ '574.89013671875', 3 ], [ '595.21484375', 7 ], [ '600.03662109375', 3 ], [ '608.062744140625', 1 ], [ '615.081787109375', 8 ], [ '631.195068359375', 9 ], [ '634.33837890625', 2 ], [ '638.031005859375', 5 ], [ '639.251708984375', 7 ], [ '672.30224609375', 8 ], [ '672.698974609375', 5 ], [ '678.64990234375', 3 ], [ '690.73486328125', 7 ], [ '710.63232421875', 5 ], [ '718.475341796875', 4 ], [ '761.3525390625', 2 ], [ '770.69091796875', 9 ], [ '790.6494140625', 1 ], * [ '796.417236328125', 8 ], [ '807.647705078125', 4 ], [ '809.38720703125', 1 ], [ '809.9365234375', 3 ], * [ '813.873291015625', 5 ], [ '819.76318359375', 4 ], [ '821.6552734375', 4 ], [ '833.80126953125', 5 ], * [ '843.170166015625', 8 ], [ '843.505859375', 2 ], [ '861.99951171875', 4 ], [ '865.6005859375', 5 ], [ '866.14990234375', 6 ], [ '871.124267578125', 4 ], * [ '890.960693359375', 9 ], [ '911.102294921875', 1 ], [ '912.353515625', 7 ], [ '919.708251953125', 1 ], [ '922.91259765625', 8 ], [ '956.695556640625', 2 ], [ '973.388671875', 5 ], * [ '977.142333984375', 8 ], [ '993.438720703125', 8 ], [ '998.1689453125', 5 ] ]; $VAR1 = [ [ '21.209716796875', 9 ], [ '72.6318359375', 7 ], [ '94.90966796875', 8 ], [ '341.033935546875', 9 ], [ '386.23046875', 9 ], [ '796.417236328125', 8 ], [ '813.873291015625', 5 ], [ '843.170166015625', 8 ], [ '890.960693359375', 9 ], [ '977.142333984375', 8 ] ];

    As you can see, not quite the results you are after.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Although you are selecting the smallest interval to discard a point from each time, when you consolidate the remaining point to a new interval, you are taking no account of the relative size of the new intervals you are creating.

      I didn't explain things properly, then. At each iteration, I am taking the smallest interval of the remaining intervals where the two old intervals have been appropriately collapsed to a new interval after each point removal. Sorry if I wasn't clear.

        That was clear. The code I posted does that by insertion sorting the new interval back into the sorted list of remaining intervals. Still, the result is the selection of the 10 highest quality points via a very slow algorithm. It will always favour quality over distance, and regularity of spacing is never considered.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Points on a line and associated intervals
by BrowserUk (Patriarch) on May 18, 2006 at 06:17 UTC

    An algorithm that may approximate your requirements is to divide the dataset into buckets of 10 equal ranges. Then sort each bucket and select the 'best' from each group.

    To approximate my reading of your criteria, I've sorted the pairs in each group by it's distance from the group median divided by it's weight (quality). This tends to favour quality over distance. I also tried the square of the distance divided by the weight, which tends to strongly favour proximity to the median, over quality. The sort can be easily adjusted to produce the results that best suit your purpose.

    It handles 1e4 in < 1 sec; 1e5 in ~5 secs; 5e5 in ~38 secs; & 1e6 in ~ 97 secs; which is close enough for me to consider the implementation as O(n) in terms of time. It will vary with ratio of data to selections to some degree, though as the number of selections grow, the individual sorts are smaller, which tends to compensate. It might benefit from a more efficient sort than the ST I've used.

    #! perl -slw use strict; use Data::Dumper; use List::Util qw[ reduce first ]; our $N ||= 100; our $MIN ||= 10; my @data = sort{ $a->[ 0 ] <=> $b->[ 0 ] } map{ [ rand 1000, 1+ int rand 9 ] } 1 .. $N; #print Dumper \@data; my $min = reduce{ $a->[0] < $b->[ 0 ] ? $a : $b } @data; my $max = reduce{ $a->[0] > $b->[ 0 ] ? $a : $b } @data; my $gap = ( $max->[ 0 ] - $min->[ 0 ] ) / $MIN ; #print "$min->[0]:$max->[0]:$gap"; my @buckets; push @{ $buckets[ ( $data[ $_ ][ 0 ] - $min->[ 0 ] -1 ) / $gap ] }, $data[ $_ ] for 0 .. $#data; #print Dumper \@buckets; my @selected; for my $bucket ( 0 .. $#buckets ) { my $mid = (( $bucket+1 ) * $gap ) - ( $gap / 2 ) + $min->[ 0 ]; $selected[ $bucket ] = ( map{ $_->[ 1 ] } sort { $a->[ 0 ] <=> $b->[ 0 ] } map { [ abs( $_->[ 0 ] - $mid ) / ( $_->[ 1 ]||1 ), $_ ] } @{ $buckets[ $bucket ] } )[ 0 ]; } my $next = 0; my $group = $min->[ 0 ]; printf "\nGroup: %f - %f; median: %f\n", $group, $group + $gap, $group + ( $gap / 2 ); $group += $gap; for my $idx ( 0 .. $#data ) { if( $data[ $idx ][ 0 ] > $group ) { printf "\nGroup: %f - %f; median: %f\n", $group, $group + $gap, $group + ( $gap / 2 ); $group += $gap; } my $selected = ' '; if( $next < @selected and $data[ $idx ][ 0 ] == $selected[ $next ] +[ 0 ] ) { $selected = $next; ++$next; } printf "%1s\t[ @{ $data[ $idx ] } ]\n", $selected; }

    in the results, the randomly chosen values (0..1000), and weights (0..9), are displayed in equal interval groups, with the selected pair from each group prefixed with '*'.

Re: Points on a line and associated intervals (metrics)
by tye (Sage) on May 18, 2006 at 07:02 UTC

    It helps to come up with some metrics that indicate how good a solution is to you. What is the disadvantage of having wide intervals? Can you try to calculate a number representing how much such hurts your results and numbers for how much the quality of the points affects the value of a particular solution? Even if you don't get great metrics, just attempting to may help you understand more precisely how you want to balance these goals against each other.

    I might approach this problem from "the opposite" direction. Add the points one-at-a-time, in order from highest to lowest quality. As you go, keep track of the largest interval left. When the largest interval is close enough to $length/($finalPointCount+1) (perhaps as a function of the quality of the worst point required), then switch to removing points.

    Starting at the worst point, remove it only if doing so wouldn't create an interval larger than desired (so you'd never remove the last point added). Move on to the next-worst point...

    I don't think you can absolutely predict how many points you'll have at the end of such a run. But you can specify the minimum interval size you'd get and the minimum point quality.

    A few other methods come to mind but they are all motivated by possible metrics I've guessed at (such as, is it relatively okay to have a large interval if it is "next to" a really high quality point?) and I don't know which metrics make sense for your problem, so I'll wait for more info at this point.

    - tye        

Re: Points on a line and associated intervals
by TedPride (Priest) on May 18, 2006 at 03:05 UTC
    Use a heap structure, where the "largest" item is the item with the shortest interval. For however many intervals you want to remove, then, all you have to do is take the top interval from the heap, modify the interval on the side you're removing the point from (you'll have links to adjacent intervals for each interval), and reheap it to its new, higher position. The interval on the other side will just change its link to the interval on this side, since the interval is now adjacent to it.

    Each interval removal should take at worst (and probably not even close to this) O(lg n), so the total complexity is O(s lg n) where s is the set number of intervals to remove, plus of course O(n lg n) to build the original heap. Easily manageable for millions of intervals.

    Or you could calculate where each remaining point needs to be on your line, given its total length and the number of points you want, then choose the point closest to that position (or weight the points nearest to it and come up with an average) and use that instead.

Re: Points on a line and associated intervals
by turo (Friar) on May 18, 2006 at 01:56 UTC

    i'm sleepy, but here it goes

    1. set an array of N element {distance=-1; point1; point2}
    2. for each element on your list
      1. take two adjacent elements, and calculate its distance
      2. compare the distance with the first element of the array.
        1. if the array element.distance is -1, then put your distance, and the info for your two adjacent points
        2. if you distance is lower, then, rotate the array to the right and put your elment to the first instance. Your array must be something like this [new_element, old 1, old 2,.., old n-1]
        3. if your distance is higher, try to compare with the next element of the array and start the point 2.2.

    When you finish, you will have (i think) an array of 10 elemnts (at complexity O(n)), which are the ones you are searching for (supposing, of course, that your points are consecutive between them ... umm ... i'm wondering if i really grasp the problem ... i'll go to bed)

    Good night

    perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
Re: Points on a line and associated intervals
by GrandFather (Saint) on May 18, 2006 at 01:56 UTC

    It would probably help a lot if you provide a sample data set (data pairs I guees to provide the position and quality values) and an indication of a "good" set of result points.

    On the face of it one approach may be to bin the points then select a point for each bin based on a "distance from bin centre" weight and the quality for each point in the bin. That would get the distribution about right by selecting lower quality points as needed to maintain a reasonable distribution.


    DWIM is Perl's answer to Gödel