in reply to Better mousetrap (getting top N values from list X)

In Perl, the fastest watermark algorithm is probably

sub top_x { my $n = shift; my @top = splice @_, 0, $n; @top = ( sort { $a <=> $b } $_, @top )[ 1 .. $n ] for @_; return @top; }

With the mergesort used in newer versions of Perl and @top being nearly sorted in all iterations but the first, sort will do its work rapidly. That will almost certainly beat any explicitly spelled out algorithm except for truly large values of $n and even longer lists (like maybe selecting the top 10,000 out of 1,000,000 elements; maybe not even that). Though I'm not sure it even beats a straight sort+slice… achieving that probably requires a list of a few thousand elements.

I had a similar wakeup call when I tried to use a heap to compete against a splice algorithm a while ago. (I can't be bothered to Super Search it right now.)

If someone cares to benchmark this, I'd very interested to see how the numbers look in practice.

It is sometimes frustrating, but clever Perl algorithms can very rarely beat builtins. If you want competitive algorithmic elegance, you'll have to drop back to XS. An XS call has a certain fixed overhead cost though so for small lists you might still lose.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Better mousetrap (getting top N values from list X)
by BrowserUk (Patriarch) on Feb 01, 2005 at 21:45 UTC

    See my pad for a benchmark.


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.

      Why not post it in the thread? Here's an adapted bench with a version of my code modified slightly to be callable like the others:

      It is interesting to see that my code wins hands down when N/MAX is close to 1. Even when the ratio of N/MAX shrinks, my code loses a lot of ground but keeps beating Limbic's proposition. None of this matters much though since for large MAX, all of the solutions perform very similarly, even if the trends remain clear.

      So out of curiosity I added the following bit to the code:

      baseline => sub { my ( $n, $list ) = @_; return ( sort { $a <=> $b } @$list )[ @$list - $n .. $#$list ] +; },

      Well, I'll just say let's pack the bags and go home folks. Nothing to see here, move along. As I said: clever Perl code vs builtin: clever Perl code loses. Grossly disproportionately, in fact.

      (Spoiler for anyone who doesn't care to run the benchmarks: in all cases the baseline sort version runs hundreds to thousands of times faster than the other solutions.)

      Makeshifts last the longest.

        Your benchmark gives as input a list in sorted order. As you mentioned, mergesort is extremely good on sorted or nearly-sorted lists. You should shuffle the list first to avoid giving the builtin a slam-dunk.

        I tried it, and in a couple of cases limbic gets comparable times and even beats baseline:

        Looking for top 5 in 10000 (running for 9.5 CPU secs) Rate aristotle browseruk limbic baseline aristotle 13.7/s -- -46% -89% -92% browseruk 25.2/s 84% -- -81% -85% limbic 130/s 847% 414% -- -23% baseline 169/s 1134% 571% 30% -- Looking for top 5 in 100000 (running for 15 CPU secs) Rate aristotle browseruk baseline limbic aristotle 1.38/s -- -44% -85% -90% browseruk 2.48/s 79% -- -73% -82% baseline 9.14/s 560% 268% -- -35% limbic 14.0/s 913% 465% 53% --