in reply to Re: sort behavior explanation required
in thread sort behavior explanation required

To illustrate how the difference between quick sort and merge sort can be huge, this is a small benchmark of the two algorithms implemented in pure Perl, sorting 5000 integers 80 times.

With an array in which the numbers have been generated with the random function:

$ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 3 wallclock secs ( 3.76 usr + 0.00 sys = 3.76 CPU) @ 21 +.28/s (n=80) quick: 2 wallclock secs ( 1.53 usr + 0.00 sys = 1.53 CPU) @ 52 +.32/s (n=80)
So, with random input, my own pure Perl implementation of quick sort is more than twice faster than my implementation of merge sort.

Now, lo and behold, see what happens if I run the same algorithms on an already sorted array:

$ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 3 wallclock secs ( 3.40 usr + 0.00 sys = 3.40 CPU) @ 23 +.53/s (n=80) quick: 199 wallclock secs (198.93 usr + 0.12 sys = 199.06 CPU) @ + 0.40/s (n=80)
Quick sort is now 58 times slower than merge sort, and about 130 times slower than on randomly generated data.

The same happens if the data is in reverse order:

$ perl bench_sort.pl Benchmark: timing 80 iterations of merge, quick... merge: 3 wallclock secs ( 3.56 usr + 0.00 sys = 3.56 CPU) @ 22 +.50/s (n=80) quick: 199 wallclock secs (197.97 usr + 0.06 sys = 198.03 CPU) @ + 0.40/s (n=80)

Admittedly, such pathological behavior of quick sort is very rare with random data, but having to sort data that is almost sorted is a very common task in real life. A typical example is the Association of Tennis Professionals (ATP) rankings: new rankings are published once a week or so. From one week to the next there will usually be relatively small differences. So that if you take the ranking of last week, update the points of each player and do the sorting again, the data is almost sorted. Quick sort would be very bad for that (actually bubble sort would do much better in such a situation). The same would happen when recalculating the index of a database table after having done a few additions or deletions. It has been suggested that, for quick sort to be efficient, one need to have a first phase of randomization of the array to be sorted.

Replies are listed 'Best First'.
Re^3: sort behavior explanation required
by AnomalousMonk (Archbishop) on Apr 20, 2014 at 18:35 UTC

    As always, the Devil lurks in the details. Can we see bench_sort.pl?

      So did you see the devil? Any error somewhere? The code I posted is not optimal, it is part of a much larger project I had two years ago, with many other sorting algorithms being tested, hence double function calls which are not really needed here. But, asides from that, do you see anything wrong in what I am saying?

        I'm embarrassed to say that, after asking you to post your code and after you went to the trouble to do so, I haven't had a chance to give it the long, careful consideration it deserves.

        I have, however, given it a quick look. I haven't seen the Devil yet, but he is wily. A few things strike my eye.

        • There's no test against the built-in sort to check that your sorts are correct.
        • I would have been interested to see timing comparisons of the two different built-in sorts that are available, quick and merge, versus the random and pathological data sets you're using. See the sort pragma.
        • Are there any other pathological data sets to throw at these two sorts?
        I hope I will shortly be able to give this the time it merits.