In Optimising a search of several thousand files, McDarren presented an interesting search problem.

He found his solution--narrowing the start point through his knowledge of where the crash happened and then applying a linear search--quite quickly, before I had a chance to offer my solution.

However, if you didn't have that knowledge of the data, this becomes quite an interesting problem. As the data points are not strictly ascending, you cannot use a binary search.

The data points plotted (in reverse time order) would look something like:

71 --- 70 --- 69-- --- 68 ------- --- 68 --- 67 --- 66 --- .. 0 --------------------------------------------------------------

Note: I've shown the possibility that he continued playing after the crash and accumlated an extra level from the restored value of 68.

The problem is similar to finding a local maxima or minima, for which there is an algorithm call a Ternary search. But if the he'd managed to raise his level after the crash, that would no longer work as it requires strictly descending values either side of the maxima. It would also need adapting to handle duplicate values.

Playing around, I came up with what I'll call a Probe Search in the absence of my ability to find any prior art that looks similar.

The basic idea is that you probe the dataset at some regular interval, looking for a value greater than the $currlevel value (or could be 0). If you fail to find a higher value (quite likely), you half the step size, loop back and probe again.

If you succeed in finding a higher value, you then have a upper/rightmost boundary at the probe that found it. And the preceeding probe forms a lower/leftmost boundary. So you set your hi and lo limits to these, recalculate the step size; reset the limit to the higher value you found, and loop back to the start.

Eventually, you will decrease the boundaries and increase the limit to the point where you will fail to find another higher value. You then step back from the hi limit linearly to find the leftmost file with the highest limit and you're there.

Here is a subroutine to implement the algorithm. I've substituted an array for the files for testing.

sub probe { use integer; my( $aref, $lo, $hi, $limit, $istep ) = @_; my $step = ( $hi - $lo ) / $istep; { my $i = $lo; $i += $step while $i <= $hi and $aref->[ $i ] <= $limit; if( $i >= $hi ) { if( $aref->[ $hi ] == $limit ) { --$hi while $aref->[ $hi - 1 ] == $limit; return $hi; } $step /= 2; } else { $limit = $aref->[ $i ]; $lo = $hi - $step; ## Update: Swapped the order of this li +ne and the next $hi = $i; $step = int( ( $hi - $lo ) / $istep ); # $istep was wrongl +y '10' } redo; } }

To test this out I generated an array of values to simulate the level values in the files:

my $ramping = 57000 - 9029; my $step = int( $ramping / 72 ); my $leftovers = $ramping - ( $step * 72 ); my $tiedObj = tie my @levels, 'Tie::Array::CountReads'; @levels = ( ( 68 ) x 9029, map( (( $_ ) x $step ), reverse 0 .. 71 ), (0) x $leftovers );

I then tied that array to a custom tie, to allow me to count the number of read references to the array. This represents the number of files you would have to read.

The interesting thing is that the size of the initial step for the probing has a dramatic affect on the efficiency of the search. Unfortunately, the relationship between the initial step size and the number of files that need to be read is rather chaotic. For example, the number of files that it needs to inspect can be as low a 34 with initial step size of either 58 or 82. But get the choice wrong by just a few and it could take as many as 670 with initial step size of 53. That's still far less than the 9030 that a linear search from the top takes though.

Update: I just found an minor error in my implementation (marked above and below). I was setting the low limit after I adjusted the high limit which meant that I was re-probing a much wider baand than necessary. Correcting that mistake means that the algorithm is now even faster. For the original problem, it now requiers visting just 7 files with the initial step size set to 11. That's a huge improvement at the good end, and fortunately, the change seems to have improved the performance across the board, although the worst case reamins about the same (though it's moves).

Update2: Another minor correction. (# $istep was wrongly '10') No great change to results.

This shows the number of files that need to be visited for all the initial steps sizes from 10 to 100

C:\test>junk8 Initial step size: 10 located 'file' 9695 after probing 318 files Initial step size: 11 located 'file' 10361 after probing 7 files Initial step size: 12 located 'file' 9029 after probing 475 files Initial step size: 13 located 'file' 10361 after probing 623 files Initial step size: 14 located 'file' 9695 after probing 505 files Initial step size: 15 located 'file' 9029 after probing 492 files Initial step size: 16 located 'file' 10361 after probing 332 files Initial step size: 17 located 'file' 9695 after probing 368 files Initial step size: 18 located 'file' 9029 after probing 476 files Initial step size: 19 located 'file' 10361 after probing 164 files Initial step size: 20 located 'file' 9695 after probing 306 files Initial step size: 21 located 'file' 10361 after probing 503 files Initial step size: 22 located 'file' 9695 after probing 673 files Initial step size: 23 located 'file' 9695 after probing 225 files Initial step size: 24 located 'file' 9029 after probing 475 files Initial step size: 25 located 'file' 9029 after probing 95 files Initial step size: 26 located 'file' 10361 after probing 608 files Initial step size: 27 located 'file' 10361 after probing 203 files Initial step size: 28 located 'file' 9695 after probing 489 files Initial step size: 29 located 'file' 9695 after probing 139 files Initial step size: 30 located 'file' 9029 after probing 475 files Initial step size: 31 located 'file' 9029 after probing 170 files Initial step size: 32 located 'file' 10361 after probing 335 files Initial step size: 33 located 'file' 10361 after probing 11 files Initial step size: 34 located 'file' 9695 after probing 371 files Initial step size: 35 located 'file' 9695 after probing 83 files Initial step size: 36 located 'file' 9029 after probing 479 files Initial step size: 37 located 'file' 9029 after probing 221 files Initial step size: 38 located 'file' 10361 after probing 143 files Initial step size: 39 located 'file' 9695 after probing 543 files Initial step size: 40 located 'file' 9695 after probing 284 files Initial step size: 41 located 'file' 9695 after probing 46 files Initial step size: 42 located 'file' 9029 after probing 481 files Initial step size: 43 located 'file' 9029 after probing 257 files Initial step size: 44 located 'file' 9029 after probing 47 files Initial step size: 45 located 'file' 9695 after probing 445 files Initial step size: 46 located 'file' 9695 after probing 229 files Initial step size: 47 located 'file' 9695 after probing 13 files Initial step size: 48 located 'file' 9029 after probing 479 files Initial step size: 49 located 'file' 9029 after probing 287 files Initial step size: 50 located 'file' 9029 after probing 95 files Initial step size: 51 located 'file' 9695 after probing 371 files Initial step size: 52 located 'file' 9695 after probing 182 files Initial step size: 53 located 'file' 9029 after probing 659 files Initial step size: 54 located 'file' 9029 after probing 479 files Initial step size: 55 located 'file' 9029 after probing 308 files Initial step size: 56 located 'file' 9029 after probing 137 files Initial step size: 57 located 'file' 9695 after probing 309 files Initial step size: 58 located 'file' 9695 after probing 139 files Initial step size: 59 located 'file' 9029 after probing 645 files Initial step size: 60 located 'file' 9029 after probing 475 files Initial step size: 61 located 'file' 9029 after probing 325 files Initial step size: 62 located 'file' 9029 after probing 175 files Initial step size: 63 located 'file' 9029 after probing 25 files Initial step size: 64 located 'file' 9695 after probing 110 files Initial step size: 65 located 'file' 9029 after probing 622 files Initial step size: 66 located 'file' 9029 after probing 479 files Initial step size: 67 located 'file' 9029 after probing 336 files Initial step size: 68 located 'file' 9029 after probing 204 files Initial step size: 69 located 'file' 9029 after probing 72 files Initial step size: 70 located 'file' 9695 after probing 89 files Initial step size: 71 located 'file' 9029 after probing 611 files Initial step size: 72 located 'file' 9029 after probing 479 files Initial step size: 73 located 'file' 9029 after probing 347 files Initial step size: 74 located 'file' 9029 after probing 227 files Initial step size: 75 located 'file' 9029 after probing 95 files Initial step size: 76 located 'file' 9695 after probing 59 files Initial step size: 77 located 'file' 9029 after probing 608 files Initial step size: 78 located 'file' 9029 after probing 478 files Initial step size: 79 located 'file' 9029 after probing 361 files Initial step size: 80 located 'file' 9029 after probing 244 files Initial step size: 81 located 'file' 9029 after probing 127 files Initial step size: 82 located 'file' 9029 after probing 23 files Initial step size: 83 located 'file' 9029 after probing 593 files Initial step size: 84 located 'file' 9029 after probing 481 files Initial step size: 85 located 'file' 9029 after probing 369 files Initial step size: 86 located 'file' 9029 after probing 257 files Initial step size: 87 located 'file' 9029 after probing 159 files Initial step size: 88 located 'file' 9029 after probing 47 files Initial step size: 89 located 'file' 9029 after probing 590 files Initial step size: 90 located 'file' 9029 after probing 485 files Initial step size: 91 located 'file' 9029 after probing 380 files Initial step size: 92 located 'file' 9029 after probing 275 files Initial step size: 93 located 'file' 9029 after probing 170 files Initial step size: 94 located 'file' 9029 after probing 80 files Initial step size: 95 located 'file' 9029 after probing 575 files Initial step size: 96 located 'file' 9029 after probing 479 files Initial step size: 97 located 'file' 9029 after probing 383 files Initial step size: 98 located 'file' 9029 after probing 287 files Initial step size: 99 located 'file' 9029 after probing 191 files Initial step size: 100 located 'file' 9029 after probing 95 files

Horribly chaotic, but the chaos seems to even out (reaching less extreme values at either end) if you use larger values for the intial step sizes. Here are the results of using step sizes from 100 to 1000 in steps of 10:

Initial step size: 100 located 'file' 9029 after probing 95 files Initial step size: 110 located 'file' 9029 after probing 317 files Initial step size: 120 located 'file' 9029 after probing 475 files Initial step size: 130 located 'file' 9029 after probing 194 files Initial step size: 140 located 'file' 9029 after probing 359 files Initial step size: 150 located 'file' 9029 after probing 95 files Initial step size: 160 located 'file' 9029 after probing 257 files Initial step size: 170 located 'file' 9029 after probing 47 files Initial step size: 180 located 'file' 9029 after probing 168 files Initial step size: 190 located 'file' 9029 after probing 275 files Initial step size: 200 located 'file' 9029 after probing 95 files Initial step size: 210 located 'file' 9029 after probing 223 files Initial step size: 220 located 'file' 9029 after probing 75 files Initial step size: 230 located 'file' 9029 after probing 151 files Initial step size: 240 located 'file' 9029 after probing 257 files Initial step size: 250 located 'file' 9029 after probing 95 files Initial step size: 260 located 'file' 9029 after probing 215 files Initial step size: 270 located 'file' 9029 after probing 91 files Initial step size: 280 located 'file' 9029 after probing 155 files Initial step size: 290 located 'file' 9029 after probing 234 files Initial step size: 300 located 'file' 9029 after probing 95 files Initial step size: 310 located 'file' 9029 after probing 175 files Initial step size: 320 located 'file' 9029 after probing 104 files Initial step size: 330 located 'file' 9029 after probing 144 files Initial step size: 340 located 'file' 9029 after probing 215 files Initial step size: 350 located 'file' 9029 after probing 103 files Initial step size: 360 located 'file' 9029 after probing 197 files Initial step size: 370 located 'file' 9029 after probing 120 files Initial step size: 380 located 'file' 9029 after probing 125 files Initial step size: 390 located 'file' 9029 after probing 89 files Initial step size: 400 located 'file' 9029 after probing 127 files Initial step size: 410 located 'file' 9029 after probing 75 files Initial step size: 420 located 'file' 9029 after probing 87 files Initial step size: 430 located 'file' 9029 after probing 152 files Initial step size: 440 located 'file' 9029 after probing 75 files Initial step size: 450 located 'file' 9029 after probing 119 files Initial step size: 460 located 'file' 9029 after probing 151 files Initial step size: 470 located 'file' 9029 after probing 125 files Initial step size: 480 located 'file' 9029 after probing 138 files Initial step size: 490 located 'file' 9029 after probing 101 files Initial step size: 500 located 'file' 9029 after probing 95 files Initial step size: 510 located 'file' 9029 after probing 159 files Initial step size: 520 located 'file' 9029 after probing 105 files Initial step size: 530 located 'file' 9029 after probing 155 files Initial step size: 540 located 'file' 9029 after probing 91 files Initial step size: 550 located 'file' 9029 after probing 127 files Initial step size: 560 located 'file' 9029 after probing 155 files Initial step size: 570 located 'file' 9029 after probing 175 files Initial step size: 580 located 'file' 9029 after probing 182 files Initial step size: 590 located 'file' 9029 after probing 190 files Initial step size: 600 located 'file' 9029 after probing 190 files Initial step size: 610 located 'file' 9029 after probing 187 files Initial step size: 620 located 'file' 9029 after probing 175 files Initial step size: 630 located 'file' 9029 after probing 166 files Initial step size: 640 located 'file' 9029 after probing 155 files Initial step size: 650 located 'file' 9029 after probing 127 files Initial step size: 660 located 'file' 9029 after probing 110 files Initial step size: 670 located 'file' 9029 after probing 177 files Initial step size: 680 located 'file' 9029 after probing 131 files Initial step size: 690 located 'file' 9029 after probing 188 files Initial step size: 700 located 'file' 9029 after probing 159 files Initial step size: 710 located 'file' 9029 after probing 128 files Initial step size: 720 located 'file' 9029 after probing 175 files Initial step size: 730 located 'file' 9029 after probing 139 files Initial step size: 740 located 'file' 9029 after probing 179 files Initial step size: 750 located 'file' 9029 after probing 171 files Initial step size: 760 located 'file' 9029 after probing 200 files Initial step size: 770 located 'file' 9029 after probing 200 files Initial step size: 780 located 'file' 9029 after probing 151 files Initial step size: 790 located 'file' 9029 after probing 173 files Initial step size: 800 located 'file' 9029 after probing 191 files Initial step size: 810 located 'file' 9029 after probing 134 files Initial step size: 820 located 'file' 9029 after probing 145 files Initial step size: 830 located 'file' 9029 after probing 152 files Initial step size: 840 located 'file' 9029 after probing 155 files Initial step size: 850 located 'file' 9029 after probing 155 files Initial step size: 860 located 'file' 9029 after probing 154 files Initial step size: 870 located 'file' 9029 after probing 149 files Initial step size: 880 located 'file' 9029 after probing 205 files Initial step size: 890 located 'file' 9029 after probing 205 files Initial step size: 900 located 'file' 9029 after probing 191 files Initial step size: 910 located 'file' 9029 after probing 173 files Initial step size: 920 located 'file' 9029 after probing 213 files Initial step size: 930 located 'file' 9029 after probing 213 files Initial step size: 940 located 'file' 9029 after probing 186 files Initial step size: 950 located 'file' 9029 after probing 215 files Initial step size: 960 located 'file' 9029 after probing 215 files Initial step size: 970 located 'file' 9029 after probing 179 files Initial step size: 980 located 'file' 9029 after probing 179 files Initial step size: 990 located 'file' 9029 after probing 197 files Initial step size: 1000 located 'file' 9029 after probing 209 files

The reason for making this a meditation (besides that McDarren had his answer and I'd done this work :), is I wonder if:

  1. This generally useful enough to be worth developing?
  2. Does anyone know of (and have a pointer to), any prior art that is similar, or tackles a similar problem?
  3. Can anyone suggest a way of 'guessing' a good initial step size? Or an approach for determining the same?

The test code:

#! perl -slw use strict; package Tie::Array::CountReads; use Tie::Array; our @ISA = 'Tie::Array'; my %readCounts; sub TIEARRAY{ my $self = bless [], $_[ 0 ]; $readCounts{ $self } = 0; return $self; } sub FETCH { ++$readCounts{ $_[0] }; $_[0]->[ $_[1] ] } sub FETCHSIZE{ scalar @{ $_[0] } } sub STORESIZE{ $#{ $_[0] } = $_[1] } sub STORE{ $_[0]->[ $_[ 1 ] ] = $_[ 2 ] } sub readCount{ $readCounts{ $_[ 0 ] } } sub resetCount{ $readCounts{ $_[0] } = 0 } package main; my $ramping = 57000 - 9029; my $step = int( $ramping / 72 ); my $leftovers = $ramping - ( $step * 72 ); my $tiedObj = tie my @levels, 'Tie::Array::CountReads'; @levels = ( ( 68 ) x 9029, map( (( $_ ) x $step ), reverse 0 .. 71 ), (0) x $leftovers ); #print scalar @levels; sub probe { use integer; my( $aref, $lo, $hi, $limit, $istep ) = @_; my $step = ( $hi - $lo ) / $istep; { my $i = $lo; $i += $step while $i <= $hi and $aref->[ $i ] <= $limit; if( $i >= $hi ) { if( $aref->[ $hi ] == $limit ) { --$hi while $aref->[ $hi - 1 ] == $limit; return $hi; } $step /= 2; } else { $limit = $aref->[ $i ]; $lo = $hi - $step; ## Update: Swapped the order of this li +ne and the next $hi = $i; $step = int( ( $hi - $lo ) / $istep ); # $istep was wrongl +y '10' } redo; } } for my $istep ( 10 .. 100 ) { printf "Initial step size: $istep located 'file' %d after probing +%d files\n", probe( \@levels, 0, $#levels, 68, $istep ), $tiedObj->readCount; $tiedObj->resetCount; } for my $istep ( map $_ *10, 10 .. 100 ) { printf "Initial step size: $istep located 'file' %d after probing +%d files\n", probe( \@levels, 0, $#levels, 68, $istep ), $tiedObj->readCount; $tiedObj->resetCount; }

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re: A different kind of search algorithm?
by diotalevi (Canon) on Jan 29, 2007 at 09:06 UTC

    That sounded like simulated annealing and what I've read recently says there isn't any generally useful preset values for your search algorithm. You could consider the problem of getting optimal preset values as just another search problem. If you had no useful ideas for initial values, you can start with random values and search for good search parameters.

    [Added: I briefly wondered "Was this the A* algorithm?" but I was wrong. I could likely find it in a nearby textbook but um... it's late and I'm more motivated to go to sleep.]

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: A different kind of search algorithm?
by zby (Vicar) on Jan 30, 2007 at 12:45 UTC
    You refer to the Optimising a search of several thousand files node - but then you say something about releasing some constraints on the search space. Without reading the original node I found it impossible to comprehend what are the constraints that you are indeed using. So what is really the problem that you presented the solution for?

      Basically, given a dataset that looks something like this (the right hand side of the peak is always descending, but with many duplicates):

      71 --- 70 --- 69-- --- 68 ------- --- 68 --- 67 --- 66 --- .. .. 1 --- 0 -------------------------------------------------------------- | | | 0 9000 57000

      The problem is to locate the peek at the 9030 Y-axis mark.

      A binary search doesn't work because that needs strictly ordered input. It would also require adapting to handle the (on average) ~700 X-axis values for each Y-axis datapoint.

      The ternary search I linked might just work if the left hand side of the peak was flat. Maybe. But it would again require adaption to handle the duplicates.

      A linear search from the left hand end will require 9030 files to be opened and read.

      My Probe search can find the peak visiting as few as 7 files, if you are lucky enough to chose the exact best input value for the number of probes to use at each stage. You'd have to get very, very lucky to do that by chance and I haven't found any way to determine it by inspection or calculation.

      However, of the values I've tested, the worst case is 670 files need to be visited, which is far faster than the linear search solution. And, You are equally as unlikely to hit the worst case as you are the best case. On average, any number of partitions from 10 to 100 would find the peak by opening just 270 files.

      A similar range of lo/ave/high seems to prevail wherever in the set the peak appears, and even when the number of files is increased to 1e5 & 1e6. Like a binary search the, bigger the dataset, the more effective it becomes.

      The problem is that there doesn't appear to be any way of determining the best partitioning value, or even just a good one. The inputs to outputs are simply chaotic as far as my maths is able to determine.

      I've made a couple of further improvements to the implementation, and I'm on the track of another. But these tend to make minor improvements to the efficiency (number of file opened/values visited), across the board, but the chaos remains.

      If I compare it against a binary search on a strictly sorted, unique dataset, with fixed partitioning value (11 currently), and a wide range of targets to find, it performs pretty well. It requires less than twice as many inspections as the binary search, and is most often much closer. But if I do an exhuastive search of the partition values for any given search, I can usually find a value that will beat the binary.

      The problem is that the optimum partitoning value varies with a) the size of the dataset; b) with the value being searched for; c) and minor variatons in the implementation. For the latter, a small variation may slightly improve the efficiency of the search when the optimum value is used, but in the process, it will move the optimum value from say 7 to 110.

      If I could find a way to determine the optimum partiton value without requiring an exhaustive search of a very wide input set, the algorithm might actually be generally useful, but I'm slowly reaching the conclusion that finding that optimum is itself, an NP-hard problem :(


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I am a bit uneasy to post this - but I still don't know what "dataset that looks something like this" means in more precise terms. The most probable guess is that this is a partial function that is decreasing at the start, then it has a peak and again is decreasing. But one can also imagine some other constraints compatible with that picture.