in reply to Sorting Hash By Large Number Values

Sorting so as to grab the top N elements is only efficient if you need the list to be sorted anyway.

On the other hand, if you just need top N, and you need it again from time to time, you might use a heap. Heaps are great for keeping track of what comes at the top, without wasting a lot of time keeping the bottom precisely sorted. If all you want is the top N elements just once, a heap is overkill, but I couldn't resist the opportunity to brush up on my recollection of how to use Heap::Simple.

The following snippet generates a hash of a whole bunch of elements. The keys are 'AA' through 'FF', and the values are random integers from 0 through 999. The hash is then converted to a heap (this constitutes the bulk of the overhead) where the heap is heap-ordered by values from the hash. Key/Value pairs are stored in anonymous arrays held in the heap. When you pluck an item from the heap, it's set up to yield the item with the greatest value. The next item you pluck will be the item with second greatest value, and so on.

Here's the example snippet...

use strict; use warnings; use Heap::Simple::Perl; my %hash; $hash{$_} = int( rand( 1000 ) ) for 'AA' .. 'FF'; my $find_n = 10; my $heap = heapify( \%hash ); my $top = get_top_n( $heap, $find_n ); foreach my $element ( @{$top} ) { print "$element->[0]\t=>$element->[1]\n"; } sub heapify { my( $href ) = $_[0]; my $heap = Heap::Simple::Perl->new( 'order' => '>', 'elements' => ['Array'] ); $heap->insert( [ $href->{$_}, $_ ] ) foreach keys %{$href}; return $heap; } sub get_top_n { my( $heap, $n ) = @_; my $top_n ; for( 1 .. $n ) { my( $value, $key ) = @{ $heap->extract_top() }; push @{ $top_n }, [ $key, $value ]; } return $top_n; }

Again, a heap will be useful and efficient if you are frequently needing the top N items from the list. But it's not all that efficient if all you're doing is grabbig the top N one time. For that, just go through the list one element at a time and keep track of the top ten values (no need to sort; that's just a waste of time).

You can get more information on heaps and priority queues here: Wikipedia: Heap, as well as (of course) from the POD for Heap::Simple. My favorite reference however is Mastering Algorithms with Perl

Again, I don't mean to imply that a heap is definately the answer to your need. It may be (depending on how many times you'll be pulling the top-n elements), but it may just be too much work. I post it here as a point of discussion and thank you for giving me the excuse to play around with these useful datastructures again.


Dave

Replies are listed 'Best First'.
Re^2: Sorting Hash By Large Number Values
by thospel (Hermit) on Oct 06, 2005 at 21:04 UTC
    The "Any" type is probably more natural for Heap::Simple here. The code then becomes something like:
    #!/usr/bin/perl -w use strict; use Heap::Simple; my %Stats = map { $_ => int rand 1000 } "AA".."FF"; my $NumOfPortsToGet = 10; my $heap = Heap::Simple->new (order => '<', elements => 'Any', max_count => $NumOfPortsToGet, dirty => 1, ); for my $key (keys %Stats ) { $heap->key_insert($Stats{$key}, $key); } print "$_ => $Stats{$_}\n" while defined($_ = $heap->extract_first);
    If you have the perl version of Heap::Simple, this will be slower than simple non-heap solutions upto a max_count of up to about 30 or so. After that the heap will win (measured without accounting for the time of the first new())

    Heap::Simple::Perl does partial evaluation with the arguments to new() to build a (relatively) efficient set of methods for exactly the type of heap you want to use. That makes the (first) new() slower, but actual runtime faster than you might expect.

    If you have the XS version of Heap::Simple it will beat the naive version for almost any max_count.

    And I think its easier to understand in all cases.

Re^2: Sorting Hash By Large Number Values
by anonymized user 468275 (Curate) on Aug 03, 2005 at 13:08 UTC
    Inserting each hash key into a heap doesn't seem to offer a great deal of performance gain if any, given that on the one hand each insert is becoming more and more costly and on the other perl knows it can optimise the sort to take advantage of the hash, whereas the heapify can't do that in this example. I suspect therefore that a great deal more brain-grief would be needed to beat the simplistic sort algorithm even though only the top n are needed.

    One world, one people

      By adding " 'max_count' => 10 " at heap creation time, you can limit the number of elements tracked in the heap. As new elements are added, inferior elements past the top ten will be dropped. That will reduce the amount of time needed to add elements and improve the efficiency of the heap solution. Of course then you don't have a heap full of values, but instead only the top ten, which may limit the datastructure's usefulness later in the script if the heap is needed again.


      Dave

        Good move - that modification convinces me. Although in that case, the underlying algorithm we now have seems too simple to bother with the overheads of a heap module (update: because a hash is a special sort of heap!). The following algorithm uses the bare bones to deliver the top-n into %SortaHeapy, without using any sorting (code updated):
        $NumOfPortsToGet = 10; local $loaded = 0; my %SortaHeapy = (); for my $key ( keys %Stats ) { ($key eq "UnauthOrigin") or SortaHeapy( $key ); } sub SortaHeapy{ my $key = shift; # if mini-heap not full just load it in if ( $loaded < $NumOfPortsToGet ) { $SortAHeapy{ $key } = $Stats{ $key }; $loaded++ return; } # otherwise do the degenerate heap sort/replace for my $hkey ( keys %SortaHeapy ) { if ( $SortaHeapy{ $hkey } < $Stats{ $key } ) { delete $SortaHeapy{ $hkey }; # replace SortaHeapy{ $hkey }; # but iterate the victim return } } }

        One world, one people

Re^2: Sorting Hash By Large Number Values
by Limbic~Region (Chancellor) on Aug 03, 2005 at 13:27 UTC
    davido,
    Sorting so as to grab the top N elements is only efficient if you need the list to be sorted anyway.

    If you haven't seen it, you might want to take a look at Better mousetrap (getting top N values from list X). I am not sure your statement is always true though in general it is good advice.

    Cheers - L~R

Re^2: Sorting Hash By Large Number Values
by anonymized user 468275 (Curate) on Aug 04, 2005 at 08:16 UTC
    Having looked into this further, it seems to me that Heap::Simple requires too much overhead, whereas Heap::Elem::NumRev implements the heap efficiently for a reversed numeric sorting order.

    One world, one people