comment on

Basically, given a dataset that looks something like this (the right hand side of the peak is always descending, but with many duplicates):

71         ---
70            ---
69--             ---
68  -------         ---
68                     ---
67                        ---
66                           ---
..                        
..
1                                                      ---
0 --------------------------------------------------------------
  |         |                                                  |
  0        9000                                               57000
[download]

The problem is to locate the peek at the 9030 Y-axis mark.

A binary search doesn't work because that needs strictly ordered input. It would also require adapting to handle the (on average) ~700 X-axis values for each Y-axis datapoint.

The ternary search I linked might just work if the left hand side of the peak was flat. Maybe. But it would again require adaption to handle the duplicates.

A linear search from the left hand end will require 9030 files to be opened and read.

My Probe search can find the peak visiting as few as 7 files, if you are lucky enough to chose the exact best input value for the number of probes to use at each stage. You'd have to get very, very lucky to do that by chance and I haven't found any way to determine it by inspection or calculation.

However, of the values I've tested, the worst case is 670 files need to be visited, which is far faster than the linear search solution. And, You are equally as unlikely to hit the worst case as you are the best case. On average, any number of partitions from 10 to 100 would find the peak by opening just 270 files.

A similar range of lo/ave/high seems to prevail wherever in the set the peak appears, and even when the number of files is increased to 1e5 & 1e6. Like a binary search the, bigger the dataset, the more effective it becomes.

The problem is that there doesn't appear to be any way of determining the best partitioning value, or even just a good one. The inputs to outputs are simply chaotic as far as my maths is able to determine.

I've made a couple of further improvements to the implementation, and I'm on the track of another. But these tend to make minor improvements to the efficiency (number of file opened/values visited), across the board, but the chaos remains.

If I compare it against a binary search on a strictly sorted, unique dataset, with fixed partitioning value (11 currently), and a wide range of targets to find, it performs pretty well. It requires less than twice as many inspections as the binary search, and is most often much closer. But if I do an exhuastive search of the partition values for any given search, I can usually find a value that will beat the binary.

The problem is that the optimum partitoning value varies with a) the size of the dataset; b) with the value being searched for; c) and minor variatons in the implementation. For the latter, a small variation may slightly improve the efficiency of the search when the optimum value is used, but in the process, it will move the optimum value from say 7 to 110.

If I could find a way to determine the optimum partiton value without requiring an exhaustive search of a very wide input set, the algorithm might actually be generally useful, but I'm slowly reaching the conclusion that finding that optimum is itself, an NP-hard problem :(

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^2: A different kind of search algorithm? by BrowserUk
in thread A different kind of search algorithm? by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.