in reply to Re: Rapid text searches ( O(1) space and time)
in thread Rapid text searches

It's been long since I learned the big O notation but IMHO your code can only be O(1) if and only if there are no gaps within the data.

Just consider the worst case of IDs within (1, (1/2 * m) .. m) i.e. (n= 1/2 m + 2) and your looking for the second entry x=m/2 right after the maximum possible gap. You'll start with approximating 3/4 m and continue to seek linearly till m/2. That are 1/2 n steps which means O(n).

If you improve your code to recursively interpolating the next position between the narrowed range instead going linearly you'll make it with O(log(n)) like in binary search.

Furthermore you're only comparing the entries from the first column aren't you?

Cheers Rolf

Replies are listed 'Best First'.
Re^3: Rapid text searches ( O(log n) time)
by BrowserUk (Patriarch) on Jun 06, 2009 at 22:08 UTC

    The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).

    For your "worst case scenario to be so, it would require the file to contain 38e6-1 consequetive numbers and a single outlier 38e6 away from one end. My conclusion was that is an unlikely scenario for real data, My assumption is for a reasonably even distribution of the values through the range.

    Furthermore you're only comparing the entries from the first column aren't you?

    I do not understand the question?

    I take the ops "Let say I need to know the pair for 1101077781160" to mean that given the input '1101077781160', he requires the other two values on that line. Eg. '1101077783656 bothChaff'


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Sorry IMHO O() is usually calculated for the worst case, otherwise "traveling salesman" and similar problems wouldn't be NP, even though there are well known algorithms which are "usually" polynomial!

      You may want to calculate the complexity depending of what you called $factor, with $factor==1 its O(1) with $factor==1/2 it's O(n).

      Anyway with binary search it's O(log n), which is the theoretical optimum for this class of problems. So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm.

      If you want to argue with something like average-case analysis you should indicate it clearly, since worst case is the usual approach.

      The average case is much more complicated to define and to calculate. You have no clue what the average case really looks like, without far more input from the OP.

      BTW: You can hardly compare collisions in hashes since they should be the exception not the rule.

      Cheers Rolf

        So O(1) would be considered a sensation in academic circles! IMHO binary search is mostly faster than your algorithm. If you want to argue with something like average-case analysis you should indicate it clearly, worst case is the usual approach.

        It was a headline, not a formal thesis. The absence of any further proof or discussion, combined with the "crude demo code", should be a strong indication of the lack of formalism.

        I have little time for theoretic correctness. Optimising for worst case, when worst case is probabilistically very unlikely, means pessimising the statistically very probable average case. And in the real world, that's a nonsense.

        For example, as you say, at O(logN), a binary search algorithm is the theoretically fastest algorithm to search ordered data. But...for practical implementations there is a well known optimisation that reverts to a linear search once the partition size drops below some threshhold. The cost of a simple increment and test being less than add-divide-index-test less than-test more than-incr or decr-etc.

        The threshhold will vary depending upon the language being used and the relative costs of performing the simple math and condition testing involved in the binary chop algorithm. I seem to recall that in hand crafted assembler it was partitions smaller than 2**7. And in C, those less than 2**6. Probably lower in Perl, but still, reality trumps theory.

        And it is that reality that forms the basis of the code I demonstrated. If you can jump directly into the dataset on your first probe and be sufficiently close to your target to be within the threshold for which reverting to a linear search is optimal, then you can beat the theoretical O(logN) by some considerable margin.

        At the extreme best-case end, you have a totally sequential dataset, the first probe will always connect and you have O(1). Even if the keys are very sparse--say 1 existing to 100 or 1000 missing--you can still achieve O(1) (or very, very close to it), if the values/gaps are evenly distributed. Ie (1, 100, 200, ...) or (1, 1000, 2000, ...) etc.

        Even if the data is very clumpy, so long as the clumps are roughly equally sized, and reasonably evenly distributed across the range, the algorithm will still often outperform a binary search. You can see this quite intuatively when you use a dictionary where the presence of sections of words with infrequent first characters (j ,k, q, v, x, z) doesn't affect the speed of use.

        It is only once you get to the other extreme of a dataset like (1..38e6, 78e6) that you approach the worst case scenario. And the probability of the 38e6 values being distributed as 38e6-1 consecutive + 1 outlier 38e6 away from one end, is something like 2/38e6! (2 over factorial 38e6). A value so vanishingly small that it would probably take a floating point value with 100s of mantissa bits (instead of 11) to be able represent it.

        In truth, there simply isn't enough information in the OP to decide whether the dictionary algorithm or a binary search will ultimately prove to be the quickest for the OPs problem. And that is the usual case here at the Monastery. We are rarely given sufficient information to be able to suggest a definitively "best solution". What makes this place work is there are enough people with enough varied experiences to mean that most every question gets several possibilities for the OP to access and it is (can only be) down to them to make the best choice for their application.

        In this case, if the OPs file is static and used frequently, then running an overnight job to load it into a DB wold be the way to go. If however it is the output of another process that is used infrequently, a search method will make more sense. If the distribution is typically reasonably even, then a dictionary search will likely be fastest. If it is typically highly skewed, then a binary search might win. If the number of search terms was a significantly a higher proportion of the total range--say 1e5 or 1e6 rather than 1e3, then building a hash of those and doing a linear pass through the file would probably win.

        In the end, only the OP can decide and the best we here can do is offer alternatives.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
          A reply falls below the community's threshold of quality. You may see it by logging in.
      The headline claim for O(1) time is based on comparison with Perl's hashes which also do a linear search down a bucket chain to resolve hash clashes, but are generally rated as O(1).
      That is because it's expected that the chains are short, and independent on the number of keys in the hash.

      But your claim basically means that any sorted datastructure that doesn't have duplicates could, after linear processing to line up the elements, be searched in O(1) time. Your algorithm only achieves O(1) search time if the data is pretty uniformly spread between its extremes. But that's an exceptional case.