in reply to Re^4: In-place sort with order assignment (-u)
in thread In-place sort with order assignment
Is "/usr/bin/sort -u" really so dumb as to sort all of the duplicates and then only eliminate them at the end? I would be a bit surprised as well as sad if that were so. I would also be surprised if it used a hash or similar structure that tends to thwart "divide and conquer" strategies that end up being required for huge sorts like /usr/bin/sort is designed for.
I'd expect a modified sorting algorithm that eliminates duplicates whenever they "touch". But I've also long wondered why I have yet to run into any discussions of a formal algorithm like that (not having searched for such). My somewhat wild guess is that the performance of such an algorithm would be closer to O(U*log(U)) than to O(D*log(D)) (U= number of unique items, D= total number of items including duplicates). A lower bound would be O(D+U*log(U)) and I suspect that is a pretty close bound.
As an aside, even with no customization of the sorting algorithm, a huge number of duplicates would speed up worst-case performance. Consider a huge list of just one item repeated over and over. Sorting it is O(N) because you need so many fewer comparisons because two things being equal means that they aren't "out of order" and so no further checking is required to pin down where each belongs in the output.
That makes a somewhat interesting case that can be estimated via observation. Generate huge lists of duplicates of equal size but with much different quantities of duplication and compare the run-times of '/usr/bin/sort' and '/usr/bin/sort -u' against them. Repeat for a wide range of list sizes. Perhaps also throw in Windows' sort.exe which doesn't have -u and might lack some optimizations related to duplicates? (No, I don't plan to spend any time running such an experiment any time soon.)
- tye
|
---|