Re^5: Rapid text searches ( O(log n) time)

So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm. If you want to argue with something like average-case analysis you should indicate it clearly, worst case is the usual approach.

It was a headline, not a formal thesis. The absence of any further proof or discussion, combined with the "crude demo code", should be a strong indication of the lack of formalism.

I have little time for theoretic correctness. Optimising for worst case, when worst case is probabilistically very unlikely, means pessimising the statistically very probable average case. And in the real world, that's a nonsense.

For example, as you say, at O(logN), a binary search algorithm is the theoretically fastest algorithm to search ordered data. But...for practical implementations there is a well known optimisation that reverts to a linear search once the partition size drops below some threshhold. The cost of a simple increment and test being less than add-divide-index-test less than-test more than-incr or decr-etc.

The threshhold will vary depending upon the language being used and the relative costs of performing the simple math and condition testing involved in the binary chop algorithm. I seem to recall that in hand crafted assembler it was partitions smaller than 2**7. And in C, those less than 2**6. Probably lower in Perl, but still, reality trumps theory.

And it is that reality that forms the basis of the code I demonstrated. If you can jump directly into the dataset on your first probe and be sufficiently close to your target to be within the threshold for which reverting to a linear search is optimal, then you can beat the theoretical O(logN) by some considerable margin.

At the extreme best-case end, you have a totally sequential dataset, the first probe will always connect and you have O(1). Even if the keys are very sparse--say 1 existing to 100 or 1000 missing--you can still achieve O(1) (or very, very close to it), if the values/gaps are evenly distributed. Ie (1, 100, 200, ...) or (1, 1000, 2000, ...) etc.

Even if the data is very clumpy, so long as the clumps are roughly equally sized, and reasonably evenly distributed across the range, the algorithm will still often outperform a binary search. You can see this quite intuatively when you use a dictionary where the presence of sections of words with infrequent first characters (j ,k, q, v, x, z) doesn't affect the speed of use.

It is only once you get to the other extreme of a dataset like (1..38e6, 78e6) that you approach the worst case scenario. And the probability of the 38e6 values being distributed as 38e6-1 consecutive + 1 outlier 38e6 away from one end, is something like 2/38e6! (2 over factorial 38e6). A value so vanishingly small that it would probably take a floating point value with 100s of mantissa bits (instead of 11) to be able represent it.

In truth, there simply isn't enough information in the OP to decide whether the dictionary algorithm or a binary search will ultimately prove to be the quickest for the OPs problem. And that is the usual case here at the Monastery. We are rarely given sufficient information to be able to suggest a definitively "best solution". What makes this place work is there are enough people with enough varied experiences to mean that most every question gets several possibilities for the OP to access and it is (can only be) down to them to make the best choice for their application.

In this case, if the OPs file is static and used frequently, then running an overnight job to load it into a DB wold be the way to go. If however it is the output of another process that is used infrequently, a search method will make more sense. If the distribution is typically reasonably even, then a dictionary search will likely be fastest. If it is typically highly skewed, then a binary search might win. If the number of search terms was a significantly a higher proportion of the total range--say 1e5 or 1e6 rather than 1e3, then building a hash of those and doing a linear pass through the file would probably win.

In the end, only the OP can decide and the best we here can do is offer alternatives.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^5: Rapid text searches ( O(log n) time)

Replies are listed 'Best First'.
Re^6: Rapid text searches ( O(log n) time) by LanX (Saint) on Jun 09, 2009 at 01:59 UTC
Frankly I didn't have the motivation to reply earlier, since I certainly cannot cope with you. (My degree in mathematics isn't enough, one in English literature seems necessary) You said "a lack of formalism" from your side was evident, unfortunately O(1) is a very well defined formula, so why do you even introduce it? Did you notice that in 20 lines of the OP's sample 5 entries were already skipped? So a 25% "gap-factor" are the Op's lower bound. With 1.5 GB of data this "best case" results in about 35 Million entries with roughly 9 Million skipped. And then your proposing an algorithm which goes linear after the first step to be O(1)...? Your now arguing with "realistic" examples of equally distributed gaps ...! For "real world conditions" better try something like Gaussian distribution. Then you claim my worst case scenario isn't realistic ... (actually I have to admit it's not even near the worst case). But look, a worst case is just the extremal point of a much bigger set of very bad cases. Just try to add ONE single line starting with 0 to the OP's sample to get exactly one of these very bad cases. And now tell me about the improbability of such data at the beginning of a file... a case were your alleged "fixed number of operations algorithm" - thats what O(1) means - will need an eternity to succeed in a 1.5 GB File. That's why worst case scenarios are investigated, with them algos are stable to any kind of unforeseen conditions. And worst cases are much easier to calculate than "average cases". (And for sure more reliable than speed sensors of airbuses over the middle of the Atlantic, where surely engineers covered "all probable cases".) Just read wikipedia and the references listed, Binary search is as well stable in average and worst cases. And Indexing methods can be much faster with just some space complexity added. PLEASE NOTE: I'm NOT saying that (a recursively till the end executed) Interpolation Search is a bad approach, I myself proposed "calculate the right lines" hours before you even posted! It's an obvious and good idea, in well known cases. But, sorry with all respect, stubbornly defending that it's O(1) is a mixture of hubris and nonsense, which makes me speechless . Whats next? Why don't you go to a conference of physicists and claim you discovered cold fusion of hydrogen! (It's simple, you just added oxygen and heated it up ...) And when people politely object, you'll insist that their definition of "cold fusion" is too formalistic and just not practical in real live! Prove you're right and publish it and I'll happily apologize to you. Cheers Rolf () ... from now on, I promise!	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^6: Rapid text searches ( O(log n) time) by LanX (Saint) on Jun 07, 2009 at 20:34 UTC
some further reading http://en.wikipedia.org/wiki/Interpolation_search Cheers Rolf	[reply]