in reply to Re: A different kind of search algorithm?
in thread A different kind of search algorithm?

Basically, given a dataset that looks something like this (the right hand side of the peak is always descending, but with many duplicates):

71 --- 70 --- 69-- --- 68 ------- --- 68 --- 67 --- 66 --- .. .. 1 --- 0 -------------------------------------------------------------- | | | 0 9000 57000

The problem is to locate the peek at the 9030 Y-axis mark.

A binary search doesn't work because that needs strictly ordered input. It would also require adapting to handle the (on average) ~700 X-axis values for each Y-axis datapoint.

The ternary search I linked might just work if the left hand side of the peak was flat. Maybe. But it would again require adaption to handle the duplicates.

A linear search from the left hand end will require 9030 files to be opened and read.

My Probe search can find the peak visiting as few as 7 files, if you are lucky enough to chose the exact best input value for the number of probes to use at each stage. You'd have to get very, very lucky to do that by chance and I haven't found any way to determine it by inspection or calculation.

However, of the values I've tested, the worst case is 670 files need to be visited, which is far faster than the linear search solution. And, You are equally as unlikely to hit the worst case as you are the best case. On average, any number of partitions from 10 to 100 would find the peak by opening just 270 files.

A similar range of lo/ave/high seems to prevail wherever in the set the peak appears, and even when the number of files is increased to 1e5 & 1e6. Like a binary search the, bigger the dataset, the more effective it becomes.

The problem is that there doesn't appear to be any way of determining the best partitioning value, or even just a good one. The inputs to outputs are simply chaotic as far as my maths is able to determine.

I've made a couple of further improvements to the implementation, and I'm on the track of another. But these tend to make minor improvements to the efficiency (number of file opened/values visited), across the board, but the chaos remains.

If I compare it against a binary search on a strictly sorted, unique dataset, with fixed partitioning value (11 currently), and a wide range of targets to find, it performs pretty well. It requires less than twice as many inspections as the binary search, and is most often much closer. But if I do an exhuastive search of the partition values for any given search, I can usually find a value that will beat the binary.

The problem is that the optimum partitoning value varies with a) the size of the dataset; b) with the value being searched for; c) and minor variatons in the implementation. For the latter, a small variation may slightly improve the efficiency of the search when the optimum value is used, but in the process, it will move the optimum value from say 7 to 110.

If I could find a way to determine the optimum partiton value without requiring an exhaustive search of a very wide input set, the algorithm might actually be generally useful, but I'm slowly reaching the conclusion that finding that optimum is itself, an NP-hard problem :(


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^3: A different kind of search algorithm?
by zby (Vicar) on Jan 30, 2007 at 16:39 UTC
    I am a bit uneasy to post this - but I still don't know what "dataset that looks something like this" means in more precise terms. The most probable guess is that this is a partial function that is decreasing at the start, then it has a peak and again is decreasing. But one can also imagine some other constraints compatible with that picture.

      Sorry. You'd have to read the original post to fully understand how the dataset was arrived at.

      I'm not going to repeat all the details from that original post, but Y-axis represents UserLevel in an RPG; the X-axis is a timebase covering several years, plotted in reverse chronological order. The dip after the peak is due to a computer crash and the restoration of backup several months old.

      All of which should tell you that there is no mathematical formula that will fit the dataset. It is simply a set of numbers that increase from 0 to some peak, drop instantly to a lower level (and then possibly increase slightly again). That, and that there are multiple (hundreds) of consecutive timebase values at each userlevel value, is all that can be said, or needs to be, to describe the problem.

      Which is simply to discover the where the peak value is, and what it is. And how to most efficiently arrive at that information.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        OK - when I read this, it is exactly what I was looking for. When I use the word 'function' I don't mean a formula for computing that function - but an 'asignment'. The conditions that you describe - i.e. that the function is increasing until a point where it drops and than increasing again are as good as any other conditions, they are precise. :)