in reply to Re^5: In-place sort with order assignment (-u)
in thread In-place sort with order assignment

Is "/usr/bin/sort -u" really so dumb as to sort all of the duplicates and then only eliminate them at the end?
Yes.

A little history. Not so long ago, memory was expensive. Computers didn't have multi-Gb RAM available. At best, they had a few Mb, to be shared with dozens, if not hundreds, of users and their programs. sort was written with this mindset: sorting huge amount of data, while being able to do this without consuming heaps of memory. To eliminate duplicates without first sorting, you need something like a hash - which either means keeping all the data in memory, or use really complex structured files.

A lower bound would be O(D+U*log(U)) and I suspect that is a pretty close bound.
Can you eliminate duplicates in ο(N * log N) worst case? (That's little-oh, not big-oh) If not, the O(D * log D) is a tight lowerbound.
  • Comment on Re^6: In-place sort with order assignment (-u)

Replies are listed 'Best First'.
Re^7: In-place sort with order assignment (dumb)
by tye (Sage) on Sep 21, 2010 at 16:20 UTC
    A little history.

    Thanks for the condescension.

    To eliminate duplicates without first sorting, you need something like a hash

    A little much-more-recent history that you appear to have completely missed:

    I'd expect a modified sorting algorithm that eliminates duplicates whenever they "touch".

    Not to mention that I already touched on the problem with "something like a hash".

    Some less recent history: The earliest versions of 'sort' lacked the '-u' option. Of course, one could use "sort | uniq" to ignore duplicates. So somebody implemented the "-u" option. I stand by the assertion of "so dumb" (or "so very lazy") to implement "-u" by only filtering output at the very last step.

    Can you eliminate duplicates in ο(N * log N) worst case? (That's little-oh, not big-oh) If not, the O(D * log D) is a tight lowerbound.

    I already noted that sorting D exact duplicates is O(D) without a single change to the algorithm. So in the extreme case of U==1, O(D*log(D)) is obviously not a lower bound. Having spent more time considering this from different angles, I'm nearly convinced that eliminating duplicates while sorting (not the false dichotomy of "either before or after" that you propose) leads to O(D*log(U)) performance.

    - tye        

      I'm nearly convinced that eliminating duplicates while sorting ...leads to O(D*log(U)) performance.

      Do you have an algorithm in mind? One that is compatible with a file-based merge sort?

      I guess if you:

      1. Sort a buffer full; write to spill file eliminating dups as you go; repeat till input read.
      2. Parallel read from spill files, again eliminating dups as you go; until output written.

      Then you'd get: O( N/S log N/S * S) + the merge.

      So for 10e6 (I'm using log10 instead of log2 for easy math):

      • Zero spill files: 10e6 * 7 = 70e6.
      • 10 spill files: 1e6 * 6 * 10 = 60e6 + merge.
      • 100 spill files: 1e5 * 5 * 100 = 50e6 + merge.
      • 1000 spill files: 1e4 * 4 * 1000 = 40e6 + merge.
      • ...

      Of course, the merge step gets more expensive the more files you have. It is at least O(N/S log S) regardless of the ratio of U to N, so I don't think that you are going to approach O(N log U). Much less O(D Log U).

        Do you have an algorithm in mind?

        To quote myself above quoting myself above:

        I'd expect a modified sorting algorithm that eliminates duplicates whenever they "touch".

        Most sorting algorithms boil down to steps of "compare two records" followed by "'move' one of the records". If two records compare equal, don't move one of them, remove one of them. The details of what "remove" means depends on the underlying sort algorithm and data structures being used. Exercise your basic programing skills for each case.

        For an insertion sort, "remove" just means "don't insert".

        For quicksort, you add a third type of swap that moves a duplicate pivot to the end of the current parition and you output a fully sorted (and possibly shortened) partition as soon as you finish with it.

        For in-memory, in-place merge sort: When comparing two items, if they are equal, just decrement the current end pointer, leaving a "don't care" item on the end of that partition. When merging, just skip duplicates and "don't care"s, which leaves a buffer of "don't care" items on the end of the larger partition. When done, the sorted duplicates are at the start of the list. The added complexity is that sorting can change the "end" of a partition so the recursive function must return a tiny bit of information back up the call stack.

        So, let's throw together a quick implementation of merge sort:

        sub mergesort { my( $av, $beg, $end )= @_; if( $beg < $end-1 ) { my $mid= int( ($beg+$end)/2 ); mergesort( $av, $beg, $mid ); mergesort( $av, $mid+1, $end ); merge( $av, $beg, $mid, $mid+1, $end ); } elsif( $beg == $end-1 ) { my $cmp= compare( $av->[$beg], $av->[$end] ); if( -1 == $cmp ) { @$av[$beg,$end]= @$av[$end,$beg]; } } }

        Now let's teach it to eliminate duplicates as it goes, highlighting what we had to add:

        sub uniqmergesort { my( $av, $beg, $end )= @_; if( $beg < $end-1 ) { my $mid= int( ($beg+$end)/2 ); my $d1= uniqmergesort( $av, $beg, $mid ); #^^^^^^ my $d2= uniqmergesort( $av, $mid+1, $end ); #^^^^^^ $d1 += uniqmerge( $av, $beg, $mid-$d1, $mid+1, $end-$d2 ); #^^^^^ ^^^^ ^^^^ ^^^^ return $d1+$d2; #^^^^^^^^^^^^^^ } if( $beg == $end-1 ) { my $cmp= compare( $av->[$beg], $av->[$end] ); if( -1 == $cmp ) { @$av[$beg,$end]= @$av[$end,$beg]; return 0; #^^^^^^^^ } return 1 if 0 == $cmp; #^^^^^^^^^^^^^^^^^^^^^^^ } return 0; #^^^^^^^^ }

        Just for completeness, here are the implementations of merge() and uniqmerge(), showing similarly minimal changes (not highlighted):

        sub merge { my( $av, $a, $b, $c, $d )= @_; my $beg= $a; my @i; while( $a <= $b && $c <= $d ) { my $cmp= compare( $av->[$a], $av->[$c] ); push @i, $a++ if -1 != $cmp; push @i, $c++ if 1 != $cmp; } @$av[$beg..$d]= @$av[ @i, $a..$b, $c..$d ]; } sub uniqmerge { my( $av, $a, $b, $c, $d )= @_; my $beg= $a; my $end= $d; my @i; while( $a <= $b && $c <= $d ) { my $cmp= compare( $av->[$a], $av->[$c] ); push @i, $a++ if -1 != $cmp; push @i, $c++ if -1 == $cmp; $c++, $end-- if 0 == $cmp; } @$av[$beg..$end]= @$av[ @i, $a..$b, $c..$d ]; return $d-$end; }

        Back to quicksort, in the middle of sorting you have a list in memory that looks like the following line and each step involves doing a comparison and then one of three different swaps shown (each swap also must increment or decrement a variable or two that points to the boundaries between the different classes of items in the current partition):

        (...done...)= items already sorted and (less dups) output p= the pivot or a duplicate of it b= "before", things known to be "less than" the pivot a= "after", things known to be "greater than" the pivot u= unsorted items (but fit within the current partition) n= next item to compare to our pivot (...to-do...)= the partitions to sort after we output this one (...done...) b b b b b u u u u u n p a a a a a p p p (...to-do...) if n is a "b", swap: b<------->u if n is an "a", swap: p-a if n is a "p", swap: p a<------->p

        Eventually yielding:

        (...done...) b b b b b b b b p a a a a a a a p p p p (...to-do...) |---sort next---|

        Eventually your partitions get small enough that you have something like:

        (...done...) b p a p p (...to-do...) |write|

        And you can output "b p a" and move on to the "to-do" items.

        (Of course, an external merge sort like what /usr/bin/sort does then combines these in-memory-sorted lists via merge sorting via temporary files, discarding duplicates during the merge step as already touched on.)

        - tye        

        so I don't think that you are going to approach O(N log U). Much less O(D Log U).

        N==D (the number of records not excluding duplicates) so your "Much less" makes no sense.

        Your computations involving N and S seem to be wandering around thinking that merge sort is something other than O(N*log(N)) (and ignoring duplicates altogether). So, no insights to be gained from those.

        - tye