comment on

So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm. If you want to argue with something like average-case analysis you should indicate it clearly, worst case is the usual approach.

It was a headline, not a formal thesis. The absence of any further proof or discussion, combined with the "crude demo code", should be a strong indication of the lack of formalism.

I have little time for theoretic correctness. Optimising for worst case, when worst case is probabilistically very unlikely, means pessimising the statistically very probable average case. And in the real world, that's a nonsense.

For example, as you say, at O(logN), a binary search algorithm is the theoretically fastest algorithm to search ordered data. But...for practical implementations there is a well known optimisation that reverts to a linear search once the partition size drops below some threshhold. The cost of a simple increment and test being less than add-divide-index-test less than-test more than-incr or decr-etc.

The threshhold will vary depending upon the language being used and the relative costs of performing the simple math and condition testing involved in the binary chop algorithm. I seem to recall that in hand crafted assembler it was partitions smaller than 2**7. And in C, those less than 2**6. Probably lower in Perl, but still, reality trumps theory.

And it is that reality that forms the basis of the code I demonstrated. If you can jump directly into the dataset on your first probe and be sufficiently close to your target to be within the threshold for which reverting to a linear search is optimal, then you can beat the theoretical O(logN) by some considerable margin.

At the extreme best-case end, you have a totally sequential dataset, the first probe will always connect and you have O(1). Even if the keys are very sparse--say 1 existing to 100 or 1000 missing--you can still achieve O(1) (or very, very close to it), if the values/gaps are evenly distributed. Ie (1, 100, 200, ...) or (1, 1000, 2000, ...) etc.

Even if the data is very clumpy, so long as the clumps are roughly equally sized, and reasonably evenly distributed across the range, the algorithm will still often outperform a binary search. You can see this quite intuatively when you use a dictionary where the presence of sections of words with infrequent first characters (j ,k, q, v, x, z) doesn't affect the speed of use.

It is only once you get to the other extreme of a dataset like (1..38e6, 78e6) that you approach the worst case scenario. And the probability of the 38e6 values being distributed as 38e6-1 consecutive + 1 outlier 38e6 away from one end, is something like 2/38e6! (2 over factorial 38e6). A value so vanishingly small that it would probably take a floating point value with 100s of mantissa bits (instead of 11) to be able represent it.

In truth, there simply isn't enough information in the OP to decide whether the dictionary algorithm or a binary search will ultimately prove to be the quickest for the OPs problem. And that is the usual case here at the Monastery. We are rarely given sufficient information to be able to suggest a definitively "best solution". What makes this place work is there are enough people with enough varied experiences to mean that most every question gets several possibilities for the OP to access and it is (can only be) down to them to make the best choice for their application.

In this case, if the OPs file is static and used frequently, then running an overnight job to load it into a DB wold be the way to go. If however it is the output of another process that is used infrequently, a search method will make more sense. If the distribution is typically reasonably even, then a dictionary search will likely be fastest. If it is typically highly skewed, then a binary search might win. If the number of search terms was a significantly a higher proportion of the total range--say 1e5 or 1e6 rather than 1e3, then building a hash of those and doing a linear pass through the file would probably win.

In the end, only the OP can decide and the best we here can do is offer alternatives.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^5: Rapid text searches ( O(log n) time) by BrowserUk
in thread Rapid text searches by joomanji

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.