in reply to Re^7: In-place sort with order assignment (dumb)
in thread In-place sort with order assignment

I'm nearly convinced that eliminating duplicates while sorting ...leads to O(D*log(U)) performance.

Do you have an algorithm in mind? One that is compatible with a file-based merge sort?

I guess if you:

  1. Sort a buffer full; write to spill file eliminating dups as you go; repeat till input read.
  2. Parallel read from spill files, again eliminating dups as you go; until output written.

Then you'd get: O( N/S log N/S * S) + the merge.

So for 10e6 (I'm using log10 instead of log2 for easy math):

Of course, the merge step gets more expensive the more files you have. It is at least O(N/S log S) regardless of the ratio of U to N, so I don't think that you are going to approach O(N log U). Much less O(D Log U).

  • Comment on Re^8: In-place sort with order assignment (dumb)

Replies are listed 'Best First'.
Re^9: In-place sort with order assignment (alg)
by tye (Sage) on Sep 21, 2010 at 19:43 UTC
    Do you have an algorithm in mind?

    To quote myself above quoting myself above:

    I'd expect a modified sorting algorithm that eliminates duplicates whenever they "touch".

    Most sorting algorithms boil down to steps of "compare two records" followed by "'move' one of the records". If two records compare equal, don't move one of them, remove one of them. The details of what "remove" means depends on the underlying sort algorithm and data structures being used. Exercise your basic programing skills for each case.

    For an insertion sort, "remove" just means "don't insert".

    For quicksort, you add a third type of swap that moves a duplicate pivot to the end of the current parition and you output a fully sorted (and possibly shortened) partition as soon as you finish with it.

    For in-memory, in-place merge sort: When comparing two items, if they are equal, just decrement the current end pointer, leaving a "don't care" item on the end of that partition. When merging, just skip duplicates and "don't care"s, which leaves a buffer of "don't care" items on the end of the larger partition. When done, the sorted duplicates are at the start of the list. The added complexity is that sorting can change the "end" of a partition so the recursive function must return a tiny bit of information back up the call stack.

    So, let's throw together a quick implementation of merge sort:

    sub mergesort { my( $av, $beg, $end )= @_; if( $beg < $end-1 ) { my $mid= int( ($beg+$end)/2 ); mergesort( $av, $beg, $mid ); mergesort( $av, $mid+1, $end ); merge( $av, $beg, $mid, $mid+1, $end ); } elsif( $beg == $end-1 ) { my $cmp= compare( $av->[$beg], $av->[$end] ); if( -1 == $cmp ) { @$av[$beg,$end]= @$av[$end,$beg]; } } }

    Now let's teach it to eliminate duplicates as it goes, highlighting what we had to add:

    sub uniqmergesort { my( $av, $beg, $end )= @_; if( $beg < $end-1 ) { my $mid= int( ($beg+$end)/2 ); my $d1= uniqmergesort( $av, $beg, $mid ); #^^^^^^ my $d2= uniqmergesort( $av, $mid+1, $end ); #^^^^^^ $d1 += uniqmerge( $av, $beg, $mid-$d1, $mid+1, $end-$d2 ); #^^^^^ ^^^^ ^^^^ ^^^^ return $d1+$d2; #^^^^^^^^^^^^^^ } if( $beg == $end-1 ) { my $cmp= compare( $av->[$beg], $av->[$end] ); if( -1 == $cmp ) { @$av[$beg,$end]= @$av[$end,$beg]; return 0; #^^^^^^^^ } return 1 if 0 == $cmp; #^^^^^^^^^^^^^^^^^^^^^^^ } return 0; #^^^^^^^^ }

    Just for completeness, here are the implementations of merge() and uniqmerge(), showing similarly minimal changes (not highlighted):

    sub merge { my( $av, $a, $b, $c, $d )= @_; my $beg= $a; my @i; while( $a <= $b && $c <= $d ) { my $cmp= compare( $av->[$a], $av->[$c] ); push @i, $a++ if -1 != $cmp; push @i, $c++ if 1 != $cmp; } @$av[$beg..$d]= @$av[ @i, $a..$b, $c..$d ]; } sub uniqmerge { my( $av, $a, $b, $c, $d )= @_; my $beg= $a; my $end= $d; my @i; while( $a <= $b && $c <= $d ) { my $cmp= compare( $av->[$a], $av->[$c] ); push @i, $a++ if -1 != $cmp; push @i, $c++ if -1 == $cmp; $c++, $end-- if 0 == $cmp; } @$av[$beg..$end]= @$av[ @i, $a..$b, $c..$d ]; return $d-$end; }

    Back to quicksort, in the middle of sorting you have a list in memory that looks like the following line and each step involves doing a comparison and then one of three different swaps shown (each swap also must increment or decrement a variable or two that points to the boundaries between the different classes of items in the current partition):

    (...done...)= items already sorted and (less dups) output p= the pivot or a duplicate of it b= "before", things known to be "less than" the pivot a= "after", things known to be "greater than" the pivot u= unsorted items (but fit within the current partition) n= next item to compare to our pivot (...to-do...)= the partitions to sort after we output this one (...done...) b b b b b u u u u u n p a a a a a p p p (...to-do...) if n is a "b", swap: b<------->u if n is an "a", swap: p-a if n is a "p", swap: p a<------->p

    Eventually yielding:

    (...done...) b b b b b b b b p a a a a a a a p p p p (...to-do...) |---sort next---|

    Eventually your partitions get small enough that you have something like:

    (...done...) b p a p p (...to-do...) |write|

    And you can output "b p a" and move on to the "to-do" items.

    (Of course, an external merge sort like what /usr/bin/sort does then combines these in-memory-sorted lists via merge sorting via temporary files, discarding duplicates during the merge step as already touched on.)

    - tye        

      I look forward to seeing your implementation and how it measures up.

Re^9: In-place sort with order assignment (calculations)
by tye (Sage) on Sep 21, 2010 at 20:06 UTC
    so I don't think that you are going to approach O(N log U). Much less O(D Log U).

    N==D (the number of records not excluding duplicates) so your "Much less" makes no sense.

    Your computations involving N and S seem to be wandering around thinking that merge sort is something other than O(N*log(N)) (and ignoring duplicates altogether). So, no insights to be gained from those.

    - tye        

      (and ignoring duplicates altogether).

      Of course it ignores duplicates. There might not be any. As with most such analysis, it also ignores the possibility that the input is already sorted.

      It only addresses your remark that duplicates could be removed on output to spill files, as well as during the merge operation.

      It would be interesting to see what difference your modifications to the classic algorithm would make in use.

      Your computations involving N and S seem to be wandering around thinking that merge sort is something other than O(N*log(N))

      Hm. If there was no scope for achieving better than O(N log N) with your "early de-duplications", then this discussion would be entirely moot.

      The reason for separating out the merge step, is because if you've already eliminated some duplicates during the spill, then the N involved in the merge is reduced from the input N. But, the unquantifiable bit, is how many of the duplicates were eliminated.

      As far as I can see, the only way you could achieve O(N log U), is if you could eliminate the duplicates without ever performing a comparison to find them. Seeing is believing.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I look forward to seeing your implementation

        I already posted a working implementation.

        Here are some runs from a little wrapper for that:

        Below I summarize those timings of sorting 1e6 records, each batch having different amounts of duplication. I output no records to eliminate "reduced I/O" as a contributing factor. This means that a mergesort() version with output would have to add more overhead to skip duplicates during output, increasing the advantage of uniqmergesort(). A file-based external merge sort (like /usr/bin/sort) would likely show an even larger gain due to reduced I/O to the intermediate files -- since I/O usually takes much more time than in-memory operations like "compare" and "move" and the I/O reduction would be by a factor closer to D/U than to the smaller, theorized log(D)/log(U) (factor for over-all speed-up of the in-memory sorting).

        uniqmergesort mergesort 52.6s (~1e6) 54.1s (1e6) 44.9s (~1e5) 49.2s 37.8s (10k) 48.1s (third line, described below) 30.3s (1k) 47.0s 23.4s (100) 42.1s 16.5s (10) 40.4s 11.1s (1) 37.5s

        So, sorting exactly 1e6 records that are roughly 100 copies each of exactly 1e4 unique records (third line above) shows uniqmergesort() taking about 30% less time than the cases without duplicates (which is close to log(1e6)/log(1e4), as I speculated). Plain mergesort() only gets a 15% speed-up from those duplicates.

        My uniqmergesort() even runs slightly faster than my mergesort() in the case with as few duplicates as I could manage with a naive use of rand() (not on Windows were rand() can only give 32k different values). So the small added complexity certainly doesn't add up to much in run-time.

        Of course it ignores duplicates. There might not be any.

        You might as well say that we might not have any records to sort and so can't start out with a given N. We must be given D records having only U unique values if we want to make calculations about how different levels of duplication can speed things up. "There might not be any [duplicates]" would only apply to the case of D==U. If you don't consider U < D then of course the calculations can't show a speed improvement (or else I'd have invented a "faster than O(NlogN) sort").

        Hm. If there was no scope for achieving better than O(N log N) with your "early de-duplications", then this discussion would be entirely moot.

        Which is why your calculations that ignore duplicates are pointless, as I said. If you finished them, you'd only arrive at "merge sort is O(N*log(N))", as is well established.

        It only addresses your remark that duplicates could be removed on output to spill files, as well as during the merge operation.

        The calculations would only address that if the calculations took into account an assumption of a given amount of duplication.

        And, to be clear, your re-statement leaves out the removing of duplicates during the in-memory sorting before even writing to a spill file -- since duplicates "touch" (are compared) before then, of course. That is the algorithm I noted in my second post and showed one full implementation of above and also described two other implementations in much less detail. So if that still hasn't sunk in, then you might want to review a little.

        Here is the full wrapper including the previously posted sort implementations for anybody who wants to play with it further.

        - tye