Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^3: Better mousetrap (getting top N values from list X)

by Aristotle (Chancellor)
on Feb 02, 2005 at 00:30 UTC ( [id://427106]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Better mousetrap (getting top N values from list X)
in thread Better mousetrap (getting top N values from list X)

Why not post it in the thread? Here's an adapted bench with a version of my code modified slightly to be callable like the others:

#!/usr/bin/perl use strict; use warnings; use List::Util qw( reduce ); use Benchmark qw( cmpthese ); my %code = ( limbic => sub { my ($x, $list) = @_; $x--; my @top; $#top = $x; for my $item ( @$list ) { next if defined $top[ -1 ] && $item <= $top[ -1 ]; for my $id ( 0 .. $#top ) { $top[ $id ] = $item and last if ! defined $top[ $id ]; if ( $item > $top[ $id ] ) { @top[ $id .. $#top ] = ($item, @top[ $id .. $#top +- 1]); last; } } } return @top; }, browseruk => sub { my( $n, $aref ) = @_; my @topN; push @topN, reduce{ $a > $b && (!@topN || $a < $topN[ -1 ] ) ? $a : ( !@topN || $b < $topN[ -1 ] ) ? $b : $a; } @$aref for 1 .. $n; return @topN; }, aristotle => sub { my ( $n, $list ) = @_; my @top = @$list[ 0 .. $n - 1 ]; @top = ( sort { $a <=> $b } $_, @top )[ 1 .. $n ] for @$list[ +$n .. $#$list ]; return @top; }, ); my @bench = ( [ qw/ 10 5 / ], [ qw/ 100 5 / ], [ qw/ 1000 5 / ], [ qw/ 10000 5 / ], [ qw/ 100000 5 / ], [ qw/ 100 50 / ], [ qw/ 1000 50 / ], [ qw/ 10000 50 / ], [ qw/ 100000 50 / ], [ qw/ 1000 500 / ], [ qw/ 10000 500 / ], [ qw/ 100000 500 / ], ); $|++; while( @bench ) { my ( $max, $n ) = @{ shift @bench }; my $duration = sprintf "%.2g", ( log( $max ) / log( 20 ) ) ** 2; print "\nLooking for top $n in $max (running for $duration CPU sec +s)\n"; my @values = 1 .. $max; my @top = ( sort { $a <=> $b } @values )[ @values - $n .. $#values + ]; for( keys %code ) { my @result = sort { $a <=> $b } $code{ $_ }->( $n, \@values ); die "$_ not ok: [@result] ne [@top]\n" if "@result" ne "@top"; } cmpthese -$duration => { map { my $x = $code{ $_ }; $_ => sub { my @x = $x->( $n, \@val +ues ) } } keys %code }; }

It is interesting to see that my code wins hands down when N/MAX is close to 1. Even when the ratio of N/MAX shrinks, my code loses a lot of ground but keeps beating Limbic's proposition. None of this matters much though since for large MAX, all of the solutions perform very similarly, even if the trends remain clear.

So out of curiosity I added the following bit to the code:

baseline => sub { my ( $n, $list ) = @_; return ( sort { $a <=> $b } @$list )[ @$list - $n .. $#$list ] +; },

Well, I'll just say let's pack the bags and go home folks. Nothing to see here, move along. As I said: clever Perl code vs builtin: clever Perl code loses. Grossly disproportionately, in fact.

(Spoiler for anyone who doesn't care to run the benchmarks: in all cases the baseline sort version runs hundreds to thousands of times faster than the other solutions.)

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^4: Better mousetrap (getting top N values from list X)
by tall_man (Parson) on Feb 02, 2005 at 01:00 UTC
    Your benchmark gives as input a list in sorted order. As you mentioned, mergesort is extremely good on sorted or nearly-sorted lists. You should shuffle the list first to avoid giving the builtin a slam-dunk.

    I tried it, and in a couple of cases limbic gets comparable times and even beats baseline:

    Looking for top 5 in 10000 (running for 9.5 CPU secs) Rate aristotle browseruk limbic baseline aristotle 13.7/s -- -46% -89% -92% browseruk 25.2/s 84% -- -81% -85% limbic 130/s 847% 414% -- -23% baseline 169/s 1134% 571% 30% -- Looking for top 5 in 100000 (running for 15 CPU secs) Rate aristotle browseruk baseline limbic aristotle 1.38/s -- -44% -85% -90% browseruk 2.48/s 79% -- -73% -82% baseline 9.14/s 560% 268% -- -35% limbic 14.0/s 913% 465% 53% --

      Sigh. One more of those too frequent infrequent moments where I feel collossally stupid. It's not like I didn't say right there in my first reply that the builtin performs particularly well in those cases…

      I'll take another look at the trends in the bechmarks with some shuffling added.

      Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427106]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-03-28 19:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found