in reply to Re^2: performance issues with grepping large sorted arrays
in thread performance issues with grepping large sorted arrays

I absolutely agree that a binary search would be faster if you have a sorted array (though sorting the array in the first place would also take quite some time, and I was under the impression that the OP would only be doing that in an attempt to speed up things).

As to "the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n)": I think in most real life situations, you're very unlikely to encounter the worst case scenario (in the OP's case, it would mean that every $key being looked up would match the last element of the array, or is not found at all). Sure, ultimately it's a statistical issue, but in the typical case there will be some improvement when short-circuiting the search. But that's a moot point anyway, as there are even better alternatives... (like hash tables, with constant-time lookup on average)

  • Comment on Re^3: performance issues with grepping large sorted arrays

Replies are listed 'Best First'.
Re^4: performance issues with grepping large sorted arrays
by gam3 (Curate) on Jul 24, 2007 at 14:25 UTC
    The point is: grep is O(n) the short circuit for loop is O(n/2) and O(n) == O(n/2).
    -- gam3
    A picture is worth a thousand words, but takes 200K.
      I'd like to ote that gam3 is noting average case behavior for short-circuiting the linear search here, not worst-case.

      O(n/2) is as good as you can expect short-circuited linear search to get over time. Absolute best case is, of course, O(1) if you happen to find what you want at the beginning of the list. Worst case is O(n). All this simplifies to terms of O(n).

      Binary search is much more efficient (save for very small lists where the setup overhead is large). It is O(log2(n)). For 800k elements, you're talking about linear search being 800k comparisons, shorted linear 400k (worst case 800k), and binary search being about 20 comparisons. There's not much to argue here.

      In this particular thread, the sort then 1000 searches is still going to be blown away by inverting the loop and not doing any searches at all. Thus is the power of the hash lookup.
      The point is: ...

      If you think that this is The point, could you elaborate on its practical relevance with respect to the statement I originally (implicitly) made, which is just that skipping unnecessary iterations/comparisons is faster (in the typical case) than doing full iterations every time (like grep). This holds even though the upper bound expressed in both algorithms' Big O complexity classification is the same. I never made any statement as to the method being O(N) or not. Also, no doubt that a binary search on a sorted list would be faster...

        Indeed, if there's no other reason to have the array sorted than to do a binary search instead of a short-circuited linear search, there may not be much overall difference between the n/2 vs. the n * log2(n) + log2(n).

        In this case, we're talking about m*n, so O(m*n/2) vs. O( n * log2(n) + m * log2(n) ). (1000 * 400k) for short-circuit linear with no sort or (800k * 20 + 1000 * 20) for sort and binary search. 400,000,000 vs. 16,020,000 typical. Although it could be worst case 800,000,000 vs 640,000,020,000.

        So between the two, there's better typical time on the quicksort and binary search, but more time stability on the short-circuited linear search with no sort. Pathological cases would hopefully be rare.

        All of which is still, in this case, much ado about nothing. There should be 800,000 hash lookups for comparison and no searching. The method with 0.002 as many comparisons as one or 0.05 as many comparisons as the other is the clear winner.