in reply to Re^10: In-place sort with order assignment (calculations)
in thread In-place sort with order assignment

I look forward to seeing your implementation

I already posted a working implementation.

Here are some runs from a little wrapper for that:

> time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e9))" 999498 unique items sorted.(9) 51.254s sort, 0.891s init. real 0m52.371s user 0m52.059s sys 0m0.284s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e9))" 1000000 total items sorted.(9) 53.015s sort, 0.929s init. real 0m54.151s user 0m53.411s sys 0m0.640s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e9))" 999459 unique items sorted.(9) 51.574s sort, 0.863s init. real 0m52.634s user 0m52.119s sys 0m0.480s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e5))" 1000000 total items sorted.(9) 47.990s sort, 1.017s init. real 0m49.207s user 0m48.675s sys 0m0.512s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e5))" 99998 unique items sorted.(9) 43.938s sort, 0.938s init. real 0m44.949s user 0m44.639s sys 0m0.296s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e4))" 10000 unique items sorted.(9) 36.991s sort, 0.869s init. real 0m37.921s user 0m37.398s sys 0m0.416s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e4))" 1000000 total items sorted.(9) 47.022s sort, 0.893s init. real 0m48.104s user 0m47.619s sys 0m0.464s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e3))" 1000000 total items sorted.(9) 45.959s sort, 0.879s init. real 0m47.030s user 0m46.555s sys 0m0.448s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e3))" 1000 unique items sorted.(9) 29.405s sort, 0.876s init. real 0m30.344s user 0m30.106s sys 0m0.232s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e2))" 100 unique items sorted.(9) 22.611s sort, 0.909s init. real 0m23.581s user 0m23.201s sys 0m0.244s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e2))" 1000000 total items sorted.(9) 41.071s sort, 0.907s init. real 0m42.182s user 0m41.619s sys 0m0.500s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e1))" 1000000 total items sorted.(9) 39.703s sort, 0.886s init. real 0m40.794s user 0m39.874s sys 0m0.524s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e1))" 10 unique items sorted.(9) 15.511s sort, 0.906s init. real 0m16.479s user 0m16.305s sys 0m0.160s > time perl dupsort.pl u 1e6 "1e9-1-int(rand(1e0))" 1 unique items sorted.(9) 10.219s sort, 0.861s init. real 0m11.145s user 0m10.961s sys 0m0.180s > time perl dupsort.pl d 1e6 "1e9-1-int(rand(1e0))" 1000000 total items sorted.(9) 36.356s sort, 0.857s init. real 0m37.483s user 0m36.978s sys 0m0.508s

Below I summarize those timings of sorting 1e6 records, each batch having different amounts of duplication. I output no records to eliminate "reduced I/O" as a contributing factor. This means that a mergesort() version with output would have to add more overhead to skip duplicates during output, increasing the advantage of uniqmergesort(). A file-based external merge sort (like /usr/bin/sort) would likely show an even larger gain due to reduced I/O to the intermediate files -- since I/O usually takes much more time than in-memory operations like "compare" and "move" and the I/O reduction would be by a factor closer to D/U than to the smaller, theorized log(D)/log(U) (factor for over-all speed-up of the in-memory sorting).

uniqmergesort mergesort 52.6s (~1e6) 54.1s (1e6) 44.9s (~1e5) 49.2s 37.8s (10k) 48.1s (third line, described below) 30.3s (1k) 47.0s 23.4s (100) 42.1s 16.5s (10) 40.4s 11.1s (1) 37.5s

So, sorting exactly 1e6 records that are roughly 100 copies each of exactly 1e4 unique records (third line above) shows uniqmergesort() taking about 30% less time than the cases without duplicates (which is close to log(1e6)/log(1e4), as I speculated). Plain mergesort() only gets a 15% speed-up from those duplicates.

My uniqmergesort() even runs slightly faster than my mergesort() in the case with as few duplicates as I could manage with a naive use of rand() (not on Windows were rand() can only give 32k different values). So the small added complexity certainly doesn't add up to much in run-time.

Of course it ignores duplicates. There might not be any.

You might as well say that we might not have any records to sort and so can't start out with a given N. We must be given D records having only U unique values if we want to make calculations about how different levels of duplication can speed things up. "There might not be any [duplicates]" would only apply to the case of D==U. If you don't consider U < D then of course the calculations can't show a speed improvement (or else I'd have invented a "faster than O(NlogN) sort").

Hm. If there was no scope for achieving better than O(N log N) with your "early de-duplications", then this discussion would be entirely moot.

Which is why your calculations that ignore duplicates are pointless, as I said. If you finished them, you'd only arrive at "merge sort is O(N*log(N))", as is well established.

It only addresses your remark that duplicates could be removed on output to spill files, as well as during the merge operation.

The calculations would only address that if the calculations took into account an assumption of a given amount of duplication.

And, to be clear, your re-statement leaves out the removing of duplicates during the in-memory sorting before even writing to a spill file -- since duplicates "touch" (are compared) before then, of course. That is the algorithm I noted in my second post and showed one full implementation of above and also described two other implementations in much less detail. So if that still hasn't sunk in, then you might want to review a little.

Here is the full wrapper including the previously posted sort implementations for anybody who wants to play with it further.

#!/usr/bin/perl -l use strict; use Time::HiRes qw< time >; sub compare { $_[1] cmp $_[0] } sub mergesort { my( $av )= @_; _mergesort( $av, 0, $#$av ); return $av; } sub uniqmergesort { my( $av )= @_; $#$av -= _uniqmergesort( $av, 0, $#$av ); return $av; } sub _mergesort { my( $av, $beg, $end )= @_; if( $beg < $end-1 ) { my $mid= int( ($beg+$end)/2 ); _mergesort( $av, $beg, $mid ); _mergesort( $av, $mid+1, $end ); merge( $av, $beg, $mid, $mid+1, $end ); } elsif( $beg == $end-1 ) { my $cmp= compare( $av->[$beg], $av->[$end] ); if( -1 == $cmp ) { @$av[$beg,$end]= @$av[$end,$beg]; } } } sub _uniqmergesort { my( $av, $beg, $end )= @_; if( $beg < $end-1 ) { my $mid= int( ($beg+$end)/2 ); my $d1= _uniqmergesort( $av, $beg, $mid ); my $d2= _uniqmergesort( $av, $mid+1, $end ); $d1 += uniqmerge( $av, $beg, $mid-$d1, $mid+1, $end-$d2 ); return $d1+$d2; } if( $beg == $end-1 ) { my $cmp= compare( $av->[$beg], $av->[$end] ); if( -1 == $cmp ) { @$av[$beg,$end]= @$av[$end,$beg]; return 0; } return 1 if 0 == $cmp; } return 0; } sub merge { my( $av, $a, $b, $c, $d )= @_; my $beg= $a; my @i; while( $a <= $b && $c <= $d ) { my $cmp= compare( $av->[$a], $av->[$c] ); push @i, $a++ if -1 != $cmp; push @i, $c++ if 1 != $cmp; } @$av[$beg..$d]= @$av[ @i, $a..$b, $c..$d ]; } sub uniqmerge { my( $av, $a, $b, $c, $d )= @_; my $beg= $a; my $end= $d; my @i; while( $a <= $b && $c <= $d ) { my $cmp= compare( $av->[$a], $av->[$c] ); push @i, $a++ if -1 != $cmp; push @i, $c++ if -1 == $cmp; $c++, $end-- if 0 == $cmp; } @$av[$beg..$end]= @$av[ @i, $a..$b, $c..$d ]; return $d-$end; } if( 3 != @ARGV || $ARGV[0] !~ /^[ud]$/ ) { print join ' ', @{mergesort([@ARGV])}; print join ' ', @{uniqmergesort([@ARGV])}; } else { my $t= time(); my( $u, $n, $e )= @ARGV; my $s= eval "package x; no strict; sub { $e }" or die "Can't compile ($e): $@\n"; my @l; $#l= $n-1; @l= (); push @l, $s->() for 1..$n; my $m= time(); if( 'u' eq $u ) { uniqmergesort( \@l ); print 0+@l, " unique items sorted.(", length($l[0]), ")"; } else { mergesort( \@l ); print 0+@l, " total items sorted.(", length($l[0]), ")"; } my $e= time(); printf "%.3fs sort, %.3fs init.\n", $e-$m, $m-$t; }

- tye        

Replies are listed 'Best First'.
Re^12: In-place sort with order assignment (runs)
by BrowserUk (Patriarch) on Sep 21, 2010 at 23:24 UTC
    And, to be clear

    For all your verbosity, you're rarely ever clear.

Re^12: In-place sort with order assignment (runs)
by BrowserUk (Patriarch) on Sep 22, 2010 at 08:11 UTC

    I'm still unconvinced that you are achieving N log U. To me that still implies that you are achieving the detection of all duplicates with a single compare each. And as some dups of a single number will end up in different extents of the input, it must take more than one per dup.

    I'd like to comment upon your code & tests, but per normal, I don't understand them. Were you ever in the military? Cos there's an old saying about the easy way; the hard way; and the army/navy/... way.

    The bottom line is that even assuming that your O(N log U) is in the ballpark, for my purposes, it doesn't allow me to economically let the sort perform the uniq'ing I need.

    109 * log( 106 ), is still horrible compared to 106 * log( 106 ) + 109 (uniq'ing via a hash). And that's before you throw in IPC and diskIO costs.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I believe that O(0.5 * (N + U) * log N) is a good approximation of the complexity of a combined mergesort-unique algorithm.

      The logic behind is that there are log N merge steps to perform. In the lower steps the probability of duplicates is very low, so the number of comparisons will be proportional to N. On the other hand, on the high merge steps, the probability of duplicates is very high and so the number of comparisons will be proportional to U.

      We can optimistically assume that the mean number of operations per step is (O+N)/2, so the total number of operations becomes proportional to (O+N) / 2 * log N

      And obviously, that can be simplified to O(N*log N).

        And obviously, that [ O(0.5 * (N + U) * log N)] can be simplified to O(N*log N).

        And that, (as I've noted here before), the trouble with big-O. It is such a blunt instrument.

        The moment you try to use it to analyse a particular variation of an algorithm in detail, some bright spark will conclude that your efforts are wrong because your detail reduces to some blunt canonical form.

        But, suggest that the variation is no different (better) than the classic algorithm, because they have the same big-O canonical reduction, and that same bright spark will tell you that you have to look in detail.

        And they'll start throwing Ds instead of Ns into the mix, but then hoist you by their petard for suggesting there might or might not be some different between D & N.

        It is obvious that tye's mergesort-unique algorithm will be more efficient than a standard mergesort on data with a high degree of duplication. The fact that in the general case across all datasets, they both reduce to the same big-O formula just goes to show what a nonsense big-O is.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I believe that O(0.5 * (N + U) * log N) is a good approximation of the complexity of a combined mergesort-unique algorithm.
        While O(0.5 * (N + U) * log N) isn't incorrect, given that 0 <= U <= N, O(0.5 * (N + U) * log N) and O(N log N) are the same. (That is, any function that's in O(0.5 * (N + U) * log N) is also in O(N log N) and visa versa.)
        The logic behind is that there are log N merge steps to perform. In the lower steps the probability of duplicates is very low, so the number of comparisons will be proportional to N.
        Actually, for O(N log U) to be different from O(N log N), U must be o(N). That is, even if only 1 in 1000 elements is unique, O(N log U) is equivalent with O(N log N) (after all log U == log(N/1000) == log(N) - log 1000). So, for a set where O(N log U) is different from O(N log N), the chances of two random elements to be the same is actually pretty high.

        I think U should even be o(Nε) for all ε > 0.