in reply to Re: A different kind of search algorithm?
in thread A different kind of search algorithm?
Basically, given a dataset that looks something like this (the right hand side of the peak is always descending, but with many duplicates):
71 --- 70 --- 69-- --- 68 ------- --- 68 --- 67 --- 66 --- .. .. 1 --- 0 -------------------------------------------------------------- | | | 0 9000 57000
The problem is to locate the peek at the 9030 Y-axis mark.
A binary search doesn't work because that needs strictly ordered input. It would also require adapting to handle the (on average) ~700 X-axis values for each Y-axis datapoint.
The ternary search I linked might just work if the left hand side of the peak was flat. Maybe. But it would again require adaption to handle the duplicates.
A linear search from the left hand end will require 9030 files to be opened and read.
My Probe search can find the peak visiting as few as 7 files, if you are lucky enough to chose the exact best input value for the number of probes to use at each stage. You'd have to get very, very lucky to do that by chance and I haven't found any way to determine it by inspection or calculation.
However, of the values I've tested, the worst case is 670 files need to be visited, which is far faster than the linear search solution. And, You are equally as unlikely to hit the worst case as you are the best case. On average, any number of partitions from 10 to 100 would find the peak by opening just 270 files.
A similar range of lo/ave/high seems to prevail wherever in the set the peak appears, and even when the number of files is increased to 1e5 & 1e6. Like a binary search the, bigger the dataset, the more effective it becomes.
The problem is that there doesn't appear to be any way of determining the best partitioning value, or even just a good one. The inputs to outputs are simply chaotic as far as my maths is able to determine.
I've made a couple of further improvements to the implementation, and I'm on the track of another. But these tend to make minor improvements to the efficiency (number of file opened/values visited), across the board, but the chaos remains.
If I compare it against a binary search on a strictly sorted, unique dataset, with fixed partitioning value (11 currently), and a wide range of targets to find, it performs pretty well. It requires less than twice as many inspections as the binary search, and is most often much closer. But if I do an exhuastive search of the partition values for any given search, I can usually find a value that will beat the binary.
The problem is that the optimum partitoning value varies with a) the size of the dataset; b) with the value being searched for; c) and minor variatons in the implementation. For the latter, a small variation may slightly improve the efficiency of the search when the optimum value is used, but in the process, it will move the optimum value from say 7 to 110.
If I could find a way to determine the optimum partiton value without requiring an exhaustive search of a very wide input set, the algorithm might actually be generally useful, but I'm slowly reaching the conclusion that finding that optimum is itself, an NP-hard problem :(
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: A different kind of search algorithm?
by zby (Vicar) on Jan 30, 2007 at 16:39 UTC | |
by BrowserUk (Patriarch) on Jan 30, 2007 at 17:33 UTC | |
by zby (Vicar) on Jan 31, 2007 at 09:46 UTC |