in reply to Re^11: In-place sort with order assignment (runs)
in thread In-place sort with order assignment

I'm still unconvinced that you are achieving N log U. To me that still implies that you are achieving the detection of all duplicates with a single compare each. And as some dups of a single number will end up in different extents of the input, it must take more than one per dup.

I'd like to comment upon your code & tests, but per normal, I don't understand them. Were you ever in the military? Cos there's an old saying about the easy way; the hard way; and the army/navy/... way.

The bottom line is that even assuming that your O(N log U) is in the ballpark, for my purposes, it doesn't allow me to economically let the sort perform the uniq'ing I need.

109 * log( 106 ), is still horrible compared to 106 * log( 106 ) + 109 (uniq'ing via a hash). And that's before you throw in IPC and diskIO costs.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re^12: In-place sort with order assignment (runs)

Replies are listed 'Best First'.
Re^13: In-place sort with order assignment (runs)
by salva (Canon) on Sep 22, 2010 at 08:46 UTC
    I believe that O(0.5 * (N + U) * log N) is a good approximation of the complexity of a combined mergesort-unique algorithm.

    The logic behind is that there are log N merge steps to perform. In the lower steps the probability of duplicates is very low, so the number of comparisons will be proportional to N. On the other hand, on the high merge steps, the probability of duplicates is very high and so the number of comparisons will be proportional to U.

    We can optimistically assume that the mean number of operations per step is (O+N)/2, so the total number of operations becomes proportional to (O+N) / 2 * log N

    And obviously, that can be simplified to O(N*log N).

      And obviously, that [ O(0.5 * (N + U) * log N)] can be simplified to O(N*log N).

      And that, (as I've noted here before), the trouble with big-O. It is such a blunt instrument.

      The moment you try to use it to analyse a particular variation of an algorithm in detail, some bright spark will conclude that your efforts are wrong because your detail reduces to some blunt canonical form.

      But, suggest that the variation is no different (better) than the classic algorithm, because they have the same big-O canonical reduction, and that same bright spark will tell you that you have to look in detail.

      And they'll start throwing Ds instead of Ns into the mix, but then hoist you by their petard for suggesting there might or might not be some different between D & N.

      It is obvious that tye's mergesort-unique algorithm will be more efficient than a standard mergesort on data with a high degree of duplication. The fact that in the general case across all datasets, they both reduce to the same big-O formula just goes to show what a nonsense big-O is.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I believe that O(0.5 * (N + U) * log N) is a good approximation of the complexity of a combined mergesort-unique algorithm.
      While O(0.5 * (N + U) * log N) isn't incorrect, given that 0 <= U <= N, O(0.5 * (N + U) * log N) and O(N log N) are the same. (That is, any function that's in O(0.5 * (N + U) * log N) is also in O(N log N) and visa versa.)
      The logic behind is that there are log N merge steps to perform. In the lower steps the probability of duplicates is very low, so the number of comparisons will be proportional to N.
      Actually, for O(N log U) to be different from O(N log N), U must be o(N). That is, even if only 1 in 1000 elements is unique, O(N log U) is equivalent with O(N log N) (after all log U == log(N/1000) == log(N) - log 1000). So, for a set where O(N log U) is different from O(N log N), the chances of two random elements to be the same is actually pretty high.

      I think U should even be o(Nε) for all ε > 0.

        O(0.5 * (N + U) * log N) and O(N log N) are the same

        That's exactly what the last sentence in my previous post said!

        So, for a set where O(N log U) is different from O(N log N), the chances of two random elements to be the same is actually pretty high.

        Only for very degraded cases where there is an element that appears with probability near to 1.