Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^3: Better mousetrap (getting top N values from list X)

by BrowserUk (Patriarch)
on Feb 03, 2005 at 10:33 UTC ( [id://427574]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Better mousetrap (getting top N values from list X)
in thread Better mousetrap (getting top N values from list X)

Having tracked down (part of) the p5p thread demerphq mentioned, the suggestion was that you would also be able to employ the short-cicuited sort by using a slice:

@most[ 0 .. 6 ] = sort @ary; # only sort 7 entries

I'm not familiar with how the RHS hints are derived, but that would seem to address both your concerns.

That said, I think topNbs() as posted by tall_man could be added to List::Util quite easily and would probably be easier to get accepted because of the lower risk.

Then again, it doesn't really use a List, so maybe it is time for an Array::Util package.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^4: Better mousetrap (getting top N values from list X)
by tall_man (Parson) on Feb 03, 2005 at 18:36 UTC
    The code would need more work before it could go into List::Util, since it only works for integer values, positive only, and only for the N highest (not lowest, or any arbitrary sorting function). It's a good idea, though.

    Since the basic algorithm needs only a size-N array, there ought to be some way to wrap this as a package that takes an arbitrary input set of unlimited size, not just an array ref as we have here.

    my $acc = new NGreatest($n); $acc->insert(\@arr); while <> { $acc->insert(chomp $_); } my @top = $acc->result();
      The code would need more work before it could go into List::Util

      Agreed, but I did have a go at making topN() accept a comparator function, but the cost off calling back into Perl code for each comparison exacts a heavy performance penalty--to the point where it looses all benefit. Though it is possible that my fledgling XS/internals skills bore some of that cost.

      The only way I see to avoid that, would be to have four separate routines, (say) highNn(); highNs(); lowNn(); lowNs();.

      I think this is the route taken by sort, albeit that the "naming convention" (for backwards compatibiity?) is sort{$a<=>$b}(); thru sort{$bcmp$a}();

      Since the basic algorithm needs only a size-N array, there ought to be some way to wrap this as a package that takes an arbitrary input set of unlimited size,...

      Personally, I'm not very keen on that type of pseudo-OO wrapper around what is essentially a functional algorithm. Would the wrapper code accumulate a copy of everything passed or discard values that didn't make the cut?

      If the former, unnecessary duplication of storage results. If the latter, I think I would prefer not to do a function call to reject individual values that don't make the cut. I'd code the accumlator scenario something like:

      my @topN = topN( $n, \@data ); @topN = topN( $n, [ @topN, grep{ chomp; $_ > $topN[ -1 ] } <> ] );

      (BTW. I really wish that chomp returned the chomped value. It would make some constructs so much easier.)

      or maybe

      my @topN( $n, \@data ); my @batch; while( not eof ARGV ) { for( my $i=0; $_ = <> and $i < $BATCHSIZE; $i++ ) { chomp; push @batch, $_; } @topN( $n, [ @topN, @batch ] ); }

      Basically, I prefer to retain programmer choice in stuff like this--but there would be nothing to stop someone adding a wrapper around it, though four (type/direction) versions would probably be needed again.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        The wrapper would discard everything that didn't make the cut, so that only N values would be needed internally instead of scalar(@X) values. I can see why batching would be a good approach, though it wouldn't be too bad to allow single-valued insertions.

        I would just like an interface capable of handling multi-gig input streams without choking, and capable of reporting a running top-N at intervals within a loop without having to be reset.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://427574]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-24 20:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found