Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Better mousetrap (getting top N values from list X)

by BrowserUk (Patriarch)
on Feb 02, 2005 at 05:27 UTC ( [id://427155]=note: print w/replies, xml ) Need Help??


in reply to Better mousetrap (getting top N values from list X)

As the benchmarks show, your algorithm, is a good one. With a few tweaks to the implementation it runs faster still:

sub top_x { my( $n, $aref ) = @_; my @topN = (0)x$n--; for my $val ( @$aref ) { next if $topN[ $n ] > $val; $topN[ $_ ] < $val and splice( @topN, $_, 0, $val ), last for 0 .. $n; } return @topN[ 0 .. $n ]; }

If this is more than an intellectual exercise, and you can handle Inline::C, then the same algorithm C-ified really flies:

Update: Correct the Inline::C implementation below to avoid calloc and allow me to free the temporary C array.

void topN( int n, AV*data ) { int *topN; int len = av_len( data ); int i, j, k; Inline_Stack_Vars; Newz( 1, topN, n + 1, int ); for( i = 0; i <= len; i++ ) { int val = SvIV( *av_fetch( data, i, 0 ) ); for( j = 0; j < n; j++ ) { if( topN[ j ] > val ) continue; if( topN[ j ] < val ) { for( k = n; k > j; k-- ) topN[ k ] = topN[ k-1 ]; topN[ j ] = val; break; } } } Inline_Stack_Reset; for( i = 0; i < n; i++ ) Inline_Stack_Push( sv_2mortal( newSViv( topN[ i ] ) ) ); Safefree( topN ); Inline_Stack_Done; }

Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^2: Better mousetrap (getting top N values from list X)
by Limbic~Region (Chancellor) on Feb 02, 2005 at 13:49 UTC
    BrowserUk,
    Thanks. I didn't spend any time trying to tweak it to squeeze every last bit of power out of it so I am pleased that it performed well. I do have a confession to make. The reason I asked the question in the first place is so that I could get a free education. Let me explain:

    While I have illusions of grandeur, I realize that many people that frequent this site are both far more knowledgeable and intelligent than I am. I haven't had any formal training on algorithms and I can't seem to force myself to do any reading on my own. I have assimilated lots of little pieces of information which allow me to write fairly efficient code naturally but I couldn't figure out the big O if you held a gun to my head. Additionally, I don't have the ability to see how changing factors, such as the length of each list, effects a given algorithm without just testing it.

    The problem with just using the TIAS approach is that you can get misleading results depending on your sample size. This isn't to say that I don't believe benchmarking isn't valuable. I do benchmark, but I seem to retain more from seeing how others solve the same problem. It is as though until they expand the boundaries of the box I am thinking in, there is only so far I can stretch my imagination.

    So again - thanks - to you, and everyone else here at the Monastery that has given me a high-priced education over the last 2 1/2 years for free.

    Cheers - L~R

      FWIW, I didn't really set out to tweak your algorithm as such. It just kind of appalled me that for many combinations of total set/desired subset sizes, the (emperically) quickest algorithm available to the Perl programmer proved to be sort'n'slice!

      Even with really quite large datasets (10k & 100k), as soon as the required subset moved above around 10%, the overhead of applying even quite a small number of perl opcodes to each value in the total set--unavoidably O(N)--outweighted the costs of performing an O(NlogN) (or whatever) sort algorithm in C.

      So I set out to see how close I could get to the C/sort performance in Perl. Starting with your algorithm was the obvious choice as it outformed everything else offered. It's canny use of short-circuiting puts it head and shoulders above the other algorithms. Then it became a case of seeing how few (and the least expensive) opcodes one could use to fulfill it.

      There may be a little more that could be trimmed from my reworking of your algorithm, but it rapidly became obvious that the only way to beat the sort'n'slice method would be to move to C--and implementing your algorithm was the obvious choice again.

      Once you start comparing like with like (ie. C with C), then the benefits of your, basically O(N), algorithm shine relative to the O(NlogN) of the sort and it wins in most (though not all) cases over the sort as you would expect.

      I'm not yet sure why the sort still wins occasionally with large subsets of large total sets--it probably comes down to the cost of extending the Perl stack to accomodate the return list, combined with the need to splice new high values into the return list as they are discovered. Perhaps someone with more XS experience than I could make it work better.

      The upshot as far as I am concerned, is a comfirmation of something I've voiced here on a few occasions. Big O notation, useful as it is, doesn't tell the whole story when (some parts of) one algorithm are performed at the C level and (some parts of) the other algorithm are performed at the Perl level. And even when both are done in C, it is very difficult to incorporate the costs of the housekeeping of an algorithm (memory allocation etc.) ) into the overall O-notation costs.

      In this case, whilst it ought to be easy to beat sort'n'slice--and is for smallish subsets of smallish total sets--it proved to be a lot harder to achieve that for the general case.

      So, whilst O-notation can give very good insights into the potential comparative costs of algorithms, in the end, a good oldfashioned benchmark of the actual implementations is always required to make the final determination.

      I've no doubt that were it possible to consider all the parts of the implementation of both algorithms in great detail, it would be possible to make the O-notation reflect the reality of them, but in my experience, that takes much longer and is much harder to do than simply code them and test it.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427155]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-28 21:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found