FWIW, I didn't really set out to tweak your algorithm as such. It just kind of appalled me that for many combinations of total set/desired subset sizes, the (emperically) quickest algorithm available to the Perl programmer proved to be sort'n'slice!

Even with really quite large datasets (10k & 100k), as soon as the required subset moved above around 10%, the overhead of applying even quite a small number of perl opcodes to each value in the total set--unavoidably O(N)--outweighted the costs of performing an O(NlogN) (or whatever) sort algorithm in C.

So I set out to see how close I could get to the C/sort performance in Perl. Starting with your algorithm was the obvious choice as it outformed everything else offered. It's canny use of short-circuiting puts it head and shoulders above the other algorithms. Then it became a case of seeing how few (and the least expensive) opcodes one could use to fulfill it.

There may be a little more that could be trimmed from my reworking of your algorithm, but it rapidly became obvious that the only way to beat the sort'n'slice method would be to move to C--and implementing your algorithm was the obvious choice again.

Once you start comparing like with like (ie. C with C), then the benefits of your, basically O(N), algorithm shine relative to the O(NlogN) of the sort and it wins in most (though not all) cases over the sort as you would expect.

I'm not yet sure why the sort still wins occasionally with large subsets of large total sets--it probably comes down to the cost of extending the Perl stack to accomodate the return list, combined with the need to splice new high values into the return list as they are discovered. Perhaps someone with more XS experience than I could make it work better.

The upshot as far as I am concerned, is a comfirmation of something I've voiced here on a few occasions. Big O notation, useful as it is, doesn't tell the whole story when (some parts of) one algorithm are performed at the C level and (some parts of) the other algorithm are performed at the Perl level. And even when both are done in C, it is very difficult to incorporate the costs of the housekeeping of an algorithm (memory allocation etc.) ) into the overall O-notation costs.

In this case, whilst it ought to be easy to beat sort'n'slice--and is for smallish subsets of smallish total sets--it proved to be a lot harder to achieve that for the general case.

So, whilst O-notation can give very good insights into the potential comparative costs of algorithms, in the end, a good oldfashioned benchmark of the actual implementations is always required to make the final determination.

I've no doubt that were it possible to consider all the parts of the implementation of both algorithms in great detail, it would be possible to make the O-notation reflect the reality of them, but in my experience, that takes much longer and is much harder to do than simply code them and test it.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

In reply to Re^3: Better mousetrap (getting top N values from list X) by BrowserUk
in thread Better mousetrap (getting top N values from list X) by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.