olepi has asked for the wisdom of the Perl Monks concerning the following question:

I need to code a special form of binary search. In this one, the array contents aren't known yet. To get a value for the array means running a process that takes 1 minute to complete. The time is spent creating the values, not in the search. I want to skip as many values as possible, only running the ones that fall into the window.

The special part comes in because there is an upper limit, and a lower limit. If the value is above the upper limit, I know that all values higher than this don't need to be calculated, they can all be set to MAX. Any value below the lower limit will be set to MIN, and all values less will be set to MIN. No need to get the actual value.

Given an array of 100 elements, what is the fastest way to find the window of values to process, while skipping the ones we don't need to process? The result is a sorted array, with MIN values up to the start of the window, and MAX values after the end of the window. How to find the minimum set of values to process that are in the window?

I doubt there is any Perl module to do this.

Basic algorithm would be something like:

Start at the middle. Run the process on the middle element. If the value is below the minimum, all the array up to that value can be set to MIN, no process needs to be run for those elements. If the value is above the maximum, all the array after that can be set to MAX, no process needs to be run for those elements. If the value is between minimum and maximum, then store the value in the array.

Repeat this until you find all the values between the minimum and maximum, and try to run as few processes as possible.

If the min and max are 1 and 10, for example, and the final array looks like 0,0,0,...1,2,3,4,5,10,101,102,103... How do I process only the 1,2,3,4,5,10 entries and skip as many of the others as possible?

The runtime savings is the goal. There are 100 elements per array, and 12 arrays per set, and 5,000 sets.

any help appreciated, thanks, Bill P

Replies are listed 'Best First'.
Re: Special binary search
by Corion (Patriarch) on Jun 22, 2021 at 14:49 UTC

    I'm not sure I understood your question correctly, but if I did, maybe your search is just two binary searches, one for the minimal element next larger than MIN and one for the maximal element smaller than MAX?

    Once you have found these elements, you know the remaining elements for which you need to run your slow process.

    Maybe if you show more examples and code, I understand where the problem is. Maybe this is enough of a testbed?

    #!perl use strict; use warnings; use feature 'experimental::signatures'; use Memoize; my @values = (0,0,0,1,2,3,4,5,10,101,102,103); my $MIN = 1; my $MAX = 10; # this is the slow part: sub get_element_value( $index ) { sleep 1; return $values[$index] }; # Cache indices that we have fetched once already memoize 'get_element_value'; sub get_index_of_element_larger_than( $value, $low, $high ) { my $next_try = int(($low+$high) / 2); my $v = get_element_value( $next_try ); if( $v > $value ) { return get_index_of_element_larger_than( $low, $next_try-1 ); } else { return get_index_of_element_larger_than( $next_try+1, $high ); } }
      This looks good, thanks!

      Yes, two binary searches is what is needed. Once I find the two values bordering the window, I can then just run all the elements between them.

      Fantastic! -- Bill P

        Since you want all of the values within the window, there is no need to do two searches. Once you find one value in the window you can get sequential values, both higher and lower, until get the ends of the window.
Re: Special binary search
by davido (Cardinal) on Jun 22, 2021 at 18:36 UTC

    A few suggestions:

    • The List::BinarySearch module contains a function called binsearch_range that will return the index of the min and max you set as targets. I think that's what you're looking for. If it weren't for my next suggestion, you could probably use it out of the box.
    • You could tie your array to a class that generates its elements only upon FETCH. This would allow List::BinarySearch to just work, while avoiding spending time generating values that aren't needed. Unfortunately tie doesn't play well with binsearch_range, so we have to use the binsearch function to implement our own ranger.
    • Additionally, your tied array could cache any values previously FETCHed. This may have minimal value, though. It could save a few comparisons in bad cases. In good cases, there wouldn't be any cache hits.
    • An alternative to tie is to make each element an instance of a class that only calculates the value if it is needed. This would eliminate the need for tie. More on this toward the end of my post...

    So here's the problem. tie doesn't work nicely with List::BinarySearch's binsearch_range function. It has something to do with that module's usage of package globals $a and $b. So we have to get a little creative, and implement our own binsearch_range. Observe the following code:

    #!/usr/bin/env perl package MyArray; use Tie::Array; use strict; use warnings; our @ISA = qw(Tie::StdArray); my %inside_out; sub FETCH { my ($self, $index) = @_; if (!defined($inside_out{"$self"}->{'cache'}->[$index])) { my $count = ++$inside_out{"$self"}->{'cache_miss_count'}; warn "FETCH (miss) $count: $index\n"; sleep 1; $inside_out{"$self"}->{'cache'}->[$index] = $self->[$index]; } else { my $count = ++$inside_out{"$self"}->{'cache_hit_count'}; warn "FETCH (hit) $count: $index\n"; } return $inside_out{"$self"}->{'cache'}->[$index]; } 1; package main; use strict; use warnings; use List::BinarySearch qw(binsearch); my ($min_sample_index, $max_sample_index) = sort {$a <=> $b} map {int(rand(100))} (0,1); # Generate two random indices in low to high order. my ($min_find_value, $max_find_value); tie my @array, 'MyArray'; my $curr_value = 0; foreach (1 .. 100) { $curr_value += int(rand(10)); push @array, $curr_value; $min_find_value = $curr_value if $_ == $min_sample_index; $max_find_value = $curr_value if $_ == $max_sample_index; } print "Looking for the range $min_find_value, $max_find_value in an ar +ray with ", scalar(@array), " elements\n"; my ($min_found_ix, $max_found_ix) = binsearch_range($min_find_value, $ +max_find_value, @array); print "Found indexes of $min_found_ix, $max_found_ix\n"; # This algorithm was taken from Mastering Algorithms with Perl, OReill +y. sub binsearch_range { my $min = shift; my $max = shift; my $min_found_ix = binsearch {$a <=> $b} $min, @_; my $max_found_ix = binsearch {$a <=> $b} $max, @_; if ($max_found_ix == scalar(@array) || ($array[$max_found_ix] <=> +$max_find_value) > 0) { $max_found_ix--; } return($min_found_ix, $max_found_ix); }

    The MyArray package implements an array class that can be tied and that overrides FETCH. Its version of FETCH sleeps for a second to simulate generating the value. It also caches results. And it keeps a tally of cache hits and cache misses.

    The main code generates a list in sorted order of integers. The integers have random offsets just to simulate a non-sequential list. That list of integers lives in a tied array that has the behaviors described above. Next, the code calls my own version of binsearch_range that does the double binary search for you. And it returns the array offset to the min value and max value sought in the array. The specific algorithm used for binsearch_range is one that came from Mastering Algorithms with Perl, from OReilly. It's an oldie but goodie.

    If you run the code, sample runs look something like this:

    Looking for the range 232, 339 in an array with 100 elements FETCH (miss) 1: 49 FETCH (miss) 2: 74 FETCH (miss) 3: 62 FETCH (miss) 4: 56 FETCH (miss) 5: 53 FETCH (miss) 6: 55 FETCH (miss) 7: 87 FETCH (miss) 8: 81 FETCH (miss) 9: 84 FETCH (miss) 10: 86 FETCH (miss) 11: 85 FETCH (hit) 1: 86 Found indexes of 56, 86

    As you can see, nearly all our FETCHes were cache misses. But the range is array offset 56 through 86. Six of those misses were within the target range, so subsequent use of that target range would benefit from the caching.

    An alternative -- probably a better alternative -- would be to store in a plain old array (not a tied array) a list of objects. Those objects would be implemented such that they only calculate their value on fetch, and cache results. What follows is untested example code:

    use List::BinarySearch qw(binsearch_range); foreach my $ix (0..99) { my @params = params($ix); push @array, MyClass->new(@params); } my ($low_ix, $high_ix) = binsearch_range { $a <=> $b->fetch } $min_wan +ted, $max_wanted, @array;

    Your MyClass could be implemented something like this:

    package MyClass; sub new { my ($class, @parameters) = @_; return bless {params => \@parameters}, $class; } sub fetch { my $self = return $self->{'cache'} //= do { sleep 1; warn "cache miss. Generating value...\n"; _gen_value($self->{'params'}) #; Generate the expensive value +and store it in the cache. }; }

    This gets you away from the tricky implementation shortcomings of tie and how it interacts with List::BinarySearch::binsearch_range (patches to the module are welcome).

    In either case, your worst case scenario is log2(100) + log2(100) lookups, which is 14 lookups for a list of 100 elements, so at two minutes per lookup, your worst case is about a half hour. Not great, but to get any better you would have to find ways to cache sub-calculations or something (I'm thinking of how nicely memoization works for recursive solutions of the Fibonacci Sequence because the sub-calculations are cached).


    Dave

Re: Special binary search
by swl (Prior) on Jun 23, 2021 at 07:58 UTC

    The binary search approaches in 11134164 and 11134184 are probably best. However, if the range indices are usually near the ends of the arrays then it would also be worth looking at a simple linear search. This is particularly so if they are, on average, in the first and last ten items as then you have 10+10=20 iterations without much overhead.

    Something like this (untested) code might work:

    # set up data my $tgt_left = 1; my $tgt_right = 10; my @data = (0,0,0,0,0,1..10,100..150); my $left = 0; $left++ while $data[$left] < $tgt_left; my $right = $#data; $right-- while $data[$right] > $tgt_right; foreach my $i ($left..$right) { # do stuff with @data }

    Benchmarking would be needed to confirm or otherwise.

      Wow, thanks for all the swift help!

      I've gone with the first binary search, but changed it to look for a value inside the window. Once found, then I can just go up and down the list and populate the array until it hits one of the window limits. Then fill out the rest of the array with MIN or MAX values. So only one binary search needed.

      I had never seen Memoize before, using that.

      The class idea is great, but probably overkill in this case.