Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^5: Better mousetrap (getting top N values from list X)

by BrowserUk (Patriarch)
on Feb 03, 2005 at 22:17 UTC ( [id://427840]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Better mousetrap (getting top N values from list X)
in thread Better mousetrap (getting top N values from list X)

The code would need more work before it could go into List::Util

Agreed, but I did have a go at making topN() accept a comparator function, but the cost off calling back into Perl code for each comparison exacts a heavy performance penalty--to the point where it looses all benefit. Though it is possible that my fledgling XS/internals skills bore some of that cost.

The only way I see to avoid that, would be to have four separate routines, (say) highNn(); highNs(); lowNn(); lowNs();.

I think this is the route taken by sort, albeit that the "naming convention" (for backwards compatibiity?) is sort{$a<=>$b}(); thru sort{$bcmp$a}();

Since the basic algorithm needs only a size-N array, there ought to be some way to wrap this as a package that takes an arbitrary input set of unlimited size,...

Personally, I'm not very keen on that type of pseudo-OO wrapper around what is essentially a functional algorithm. Would the wrapper code accumulate a copy of everything passed or discard values that didn't make the cut?

If the former, unnecessary duplication of storage results. If the latter, I think I would prefer not to do a function call to reject individual values that don't make the cut. I'd code the accumlator scenario something like:

my @topN = topN( $n, \@data ); @topN = topN( $n, [ @topN, grep{ chomp; $_ > $topN[ -1 ] } <> ] );

(BTW. I really wish that chomp returned the chomped value. It would make some constructs so much easier.)

or maybe

my @topN( $n, \@data ); my @batch; while( not eof ARGV ) { for( my $i=0; $_ = <> and $i < $BATCHSIZE; $i++ ) { chomp; push @batch, $_; } @topN( $n, [ @topN, @batch ] ); }

Basically, I prefer to retain programmer choice in stuff like this--but there would be nothing to stop someone adding a wrapper around it, though four (type/direction) versions would probably be needed again.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^6: Better mousetrap (getting top N values from list X)
by tall_man (Parson) on Feb 04, 2005 at 00:54 UTC
    The wrapper would discard everything that didn't make the cut, so that only N values would be needed internally instead of scalar(@X) values. I can see why batching would be a good approach, though it wouldn't be too bad to allow single-valued insertions.

    I would just like an interface capable of handling multi-gig input streams without choking, and capable of reporting a running top-N at intervals within a loop without having to be reset.

      Understood. The reason for my reluctance to do things one at a time is because I saw the effects of doing a C to perl callback for the comparator. Just calling the comparator (without actually using it's return value) drops my topN() routine from first, to second last place in the benchmark.

      [ 1:01:43.25] P:\test>427029 -MAX=100 -N=10 10 from 100 (1 seconds) 100|99|98|97|96|95|94|93|92|91 Rate Ari Buk LR BukLR perl C Ari 464/s -- -74% -83% -93% -98% -99% Buk 1768/s 281% -- -35% -74% -93% -97% LR 2731/s 488% 54% -- -60% -89% -95% BukLR 6876/s 1381% 289% 152% -- -72% -88% perl 24678/s 5217% 1296% 804% 259% -- -57% C 57443/s 12276% 3149% 2004% 735% 133% -- [ 1:02:07.47] P:\test>427029 -MAX=100 -N=10 10 from 100 (1 seconds) 100|99|98|97|96|95|94|93|92|91 Rate Ari C Buk LR BukLR perl Ari 477/s -- -32% -72% -79% -92% -98% C 703/s 47% -- -59% -69% -88% -97% Buk 1729/s 262% 146% -- -24% -71% -93% LR 2266/s 375% 222% 31% -- -61% -91% BukLR 5881/s 1132% 737% 240% 159% -- -77% perl 25039/s 5146% 3463% 1348% 1005% 326% --

      There may be some fat that could be trimmed as I put the callback code in a separate C function (per the examples) in order to isolate all the stack manipulations.

      That could probably be inlined with care, but I also tried calling a dummy C-function and it is very clear that it is the callback to perl, not the call to the C function that is chewing all the time. Once you get into the C code, it's better to stay and do as much work as possible in one go.

      That said, I guess you could alway do the batching within the wrapping code also, which would make sense.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        Out of curiousity, was your comparator of the {$a <=> $b} type, or did you pass parameters to it? It seems as if we could tap into sort's efficient comparator speed if we could use $a and $b from C code. Generic comparisons might be easier that way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427840]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-04-18 09:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found