in reply to Re^2: Rapid text searches ( O(log n) time)
in thread Rapid text searches

The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).

For your "worst case scenario to be so, it would require the file to contain 38e6-1 consequetive numbers and a single outlier 38e6 away from one end. My conclusion was that is an unlikely scenario for real data, My assumption is for a reasonably even distribution of the values through the range.

Furthermore you're only comparing the entries from the first column aren't you?

I do not understand the question?

I take the ops "Let say I need to know the pair for 1101077781160" to mean that given the input '1101077781160', he requires the other two values on that line. Eg. '1101077783656 bothChaff'


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re^3: Rapid text searches ( O(log n) time)

Replies are listed 'Best First'.
Re^4: Rapid text searches ( O(log n) time)
by LanX (Saint) on Jun 07, 2009 at 01:35 UTC
    Sorry IMHO O() is usually calculated for the worst case, otherwise "traveling salesman" and similar problems wouldn't be NP, even though there are well known algorithms which are "usually" polynomial!

    You may want to calculate the complexity depending of what you called $factor, with $factor==1 its O(1) with $factor==1/2 it's O(n).

    Anyway with binary search it's O(log n), which is the theoretical optimum for this class of problems. So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm.

    If you want to argue with something like average-case analysis you should indicate it clearly, since worst case is the usual approach.

    The average case is much more complicated to define and to calculate. You have no clue what the average case really looks like, without far more input from the OP.

    BTW: You can hardly compare collisions in hashes since they should be the exception not the rule.

    Cheers Rolf

      So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm. If you want to argue with something like average-case analysis you should indicate it clearly, worst case is the usual approach.

      It was a headline, not a formal thesis. The absence of any further proof or discussion, combined with the "crude demo code", should be a strong indication of the lack of formalism.

      I have little time for theoretic correctness. Optimising for worst case, when worst case is probabilistically very unlikely, means pessimising the statistically very probable average case. And in the real world, that's a nonsense.

      For example, as you say, at O(logN), a binary search algorithm is the theoretically fastest algorithm to search ordered data. But...for practical implementations there is a well known optimisation that reverts to a linear search once the partition size drops below some threshhold. The cost of a simple increment and test being less than add-divide-index-test less than-test more than-incr or decr-etc.

      The threshhold will vary depending upon the language being used and the relative costs of performing the simple math and condition testing involved in the binary chop algorithm. I seem to recall that in hand crafted assembler it was partitions smaller than 2**7. And in C, those less than 2**6. Probably lower in Perl, but still, reality trumps theory.

      And it is that reality that forms the basis of the code I demonstrated. If you can jump directly into the dataset on your first probe and be sufficiently close to your target to be within the threshold for which reverting to a linear search is optimal, then you can beat the theoretical O(logN) by some considerable margin.

      At the extreme best-case end, you have a totally sequential dataset, the first probe will always connect and you have O(1). Even if the keys are very sparse--say 1 existing to 100 or 1000 missing--you can still achieve O(1) (or very, very close to it), if the values/gaps are evenly distributed. Ie (1, 100, 200, ...) or (1, 1000, 2000, ...) etc.

      Even if the data is very clumpy, so long as the clumps are roughly equally sized, and reasonably evenly distributed across the range, the algorithm will still often outperform a binary search. You can see this quite intuatively when you use a dictionary where the presence of sections of words with infrequent first characters (j ,k, q, v, x, z) doesn't affect the speed of use.

      It is only once you get to the other extreme of a dataset like (1..38e6, 78e6) that you approach the worst case scenario. And the probability of the 38e6 values being distributed as 38e6-1 consecutive + 1 outlier 38e6 away from one end, is something like 2/38e6! (2 over factorial 38e6). A value so vanishingly small that it would probably take a floating point value with 100s of mantissa bits (instead of 11) to be able represent it.

      In truth, there simply isn't enough information in the OP to decide whether the dictionary algorithm or a binary search will ultimately prove to be the quickest for the OPs problem. And that is the usual case here at the Monastery. We are rarely given sufficient information to be able to suggest a definitively "best solution". What makes this place work is there are enough people with enough varied experiences to mean that most every question gets several possibilities for the OP to access and it is (can only be) down to them to make the best choice for their application.

      In this case, if the OPs file is static and used frequently, then running an overnight job to load it into a DB wold be the way to go. If however it is the output of another process that is used infrequently, a search method will make more sense. If the distribution is typically reasonably even, then a dictionary search will likely be fastest. If it is typically highly skewed, then a binary search might win. If the number of search terms was a significantly a higher proportion of the total range--say 1e5 or 1e6 rather than 1e3, then building a hash of those and doing a linear pass through the file would probably win.

      In the end, only the OP can decide and the best we here can do is offer alternatives.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Frankly I didn't have the motivation to reply earlier, since I certainly cannot cope with you. (My degree in mathematics isn't enough, one in English literature seems necessary)

        You said "a lack of formalism" from your side was evident, unfortunately O(1) is a very well defined formula, so why do you even introduce it?

        Did you notice that in 20 lines of the OP's sample 5 entries were already skipped? So a 25% "gap-factor" are the Op's lower bound.

        With 1.5 GB of data this "best case" results in about 35 Million entries with roughly 9 Million skipped. And then your proposing an algorithm which goes linear after the first step to be O(1)...?

        Your now arguing with "realistic" examples of equally distributed gaps ...! For "real world conditions" better try something like Gaussian distribution.

        Then you claim my worst case scenario isn't realistic ... (actually I have to admit it's not even near the worst case). But look, a worst case is just the extremal point of a much bigger set of very bad cases.

        Just try to add ONE single line starting with 0 to the OP's sample to get exactly one of these very bad cases. And now tell me about the improbability of such data at the beginning of a file... a case were your alleged "fixed number of operations algorithm" - thats what O(1) means - will need an eternity to succeed in a 1.5 GB File.

        That's why worst case scenarios are investigated, with them algos are stable to any kind of unforeseen conditions. And worst cases are much easier to calculate than "average cases". (And for sure more reliable than speed sensors of airbuses over the middle of the Atlantic, where surely engineers covered "all probable cases".)

        Just read wikipedia and the references listed, Binary search is as well stable in average and worst cases. And Indexing methods can be much faster with just some space complexity added.

        PLEASE NOTE: I'm NOT saying that (a recursively till the end executed) Interpolation Search is a bad approach, I myself proposed "calculate the right lines" hours before you even posted! It's an obvious and good idea, in well known cases.

        But, sorry with all respect, stubbornly defending that it's O(1) is a mixture of hubris and nonsense, which makes me speechless *.

        Whats next? Why don't you go to a conference of physicists and claim you discovered cold fusion of hydrogen! (It's simple, you just added oxygen and heated it up ...) And when people politely object, you'll insist that their definition of "cold fusion" is too formalistic and just not practical in real live!

        Prove you're right and publish it and I'll happily apologize to you.

        Cheers Rolf

        (*) ... from now on, I promise!

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re^4: Rapid text searches ( O(log n) time)
by JavaFan (Canon) on Jun 10, 2009 at 09:21 UTC
    The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).
    That is because it's expected that the chains are short, and independent on the number of keys in the hash.

    But your claim basically means that any sorted datastructure that doesn't have duplicates could, after linear processing to line up the elements, be searched in O(1) time. Your algorithm only achieves O(1) search time if the data is pretty uniformly spread between its extremes. But that's an exceptional case.