I look forward to seeing your implementation
I already posted a working implementation.
Here are some runs from a little wrapper for that:
Below I summarize those timings of sorting 1e6 records, each batch having
different amounts of duplication. I output no records to eliminate
"reduced I/O" as a contributing factor. This means that a mergesort() version
with output would have to add more overhead to skip duplicates during output,
increasing the advantage of uniqmergesort(). A file-based external merge sort
(like /usr/bin/sort) would likely show an even larger gain due to reduced I/O
to the intermediate files -- since I/O usually takes much more time than
in-memory operations like "compare" and "move" and the I/O reduction would be
by a factor closer to D/U than to the smaller, theorized log(D)/log(U) (factor
for over-all speed-up of the in-memory sorting).
uniqmergesort mergesort
52.6s (~1e6) 54.1s (1e6)
44.9s (~1e5) 49.2s
37.8s (10k) 48.1s (third line, described below)
30.3s (1k) 47.0s
23.4s (100) 42.1s
16.5s (10) 40.4s
11.1s (1) 37.5s
So, sorting exactly 1e6 records that are roughly 100 copies each of exactly 1e4
unique records (third line above) shows uniqmergesort() taking about 30% less
time than the cases without duplicates (which is close to log(1e6)/log(1e4), as I
speculated). Plain mergesort() only gets a 15% speed-up from those duplicates.
My uniqmergesort() even runs slightly faster than my mergesort() in the case
with as few duplicates as I could manage with a naive use of rand() (not on
Windows were rand() can only give 32k different values). So the small added
complexity certainly doesn't add up to much in run-time.
Of course it ignores duplicates. There might not be any.
You might as well say that we might not have any records to sort and so can't
start out with a given N. We must be given D records having only U
unique values if we want to make calculations about how different levels of
duplication can speed things up. "There might not be any [duplicates]" would only apply
to the case of D==U. If you don't consider U < D then of course the
calculations can't show a speed improvement (or else I'd have invented a "faster than
O(NlogN) sort").
Hm. If there was no scope for achieving better than O(N log N) with your
"early de-duplications", then this discussion would be entirely moot.
Which is why your calculations that ignore duplicates are pointless, as I said.
If you finished them, you'd only arrive at "merge sort is O(N*log(N))", as is
well established.
It only addresses your remark that duplicates could be removed on output to
spill files, as well as during the merge operation.
The calculations would only address that if the calculations took
into account an assumption of a given amount of duplication.
And, to be clear, your re-statement leaves out the removing of duplicates during the
in-memory sorting before even writing to a spill file -- since duplicates
"touch" (are compared) before then, of course. That is the algorithm I
noted in my second post and showed one full implementation of above
and also described two other implementations in much less detail. So if
that still hasn't sunk in, then you might want to review a little.
Here is the full wrapper including the previously posted sort implementations
for anybody who wants to play with it further.
|