Sorting Hash By Large Number Values

NathanE has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting Hash By Large Number Values by JediWizard (Deacon) on Aug 02, 2005 at 23:36 UTC
you need to use numeric comparison rather than string comparison. Aka. `<=>` or the "spacship operator". Change cmp to `<=>`. See perlop Update In the intrest of being complete: You may have troubles if not all values are numeric. I believe that Regex::Common::Number can help you identify numbers as opposed to strings. Update 2: I just checked.... numbers should sort lexicographically prior to letters... that should not cause much of an issue. Sorry for the false lead. They say that time changes things, but you actually have to change them yourself. —Andy Warhol	[reply] [d/l] [select]
Re: Sorting Hash By Large Number Values by davido (Cardinal) on Aug 03, 2005 at 05:53 UTC
Sorting so as to grab the top N elements is only efficient if you need the list to be sorted anyway. On the other hand, if you just need top N, and you need it again from time to time, you might use a heap. Heaps are great for keeping track of what comes at the top, without wasting a lot of time keeping the bottom precisely sorted. If all you want is the top N elements just once, a heap is overkill, but I couldn't resist the opportunity to brush up on my recollection of how to use Heap::Simple. The following snippet generates a hash of a whole bunch of elements. The keys are 'AA' through 'FF', and the values are random integers from 0 through 999. The hash is then converted to a heap (this constitutes the bulk of the overhead) where the heap is heap-ordered by values from the hash. Key/Value pairs are stored in anonymous arrays held in the heap. When you pluck an item from the heap, it's set up to yield the item with the greatest value. The next item you pluck will be the item with second greatest value, and so on. Here's the example snippet... use strict; use warnings; use Heap::Simple::Perl; my %hash; $hash{$_} = int( rand( 1000 ) ) for 'AA' .. 'FF'; my $find_n = 10; my $heap = heapify( \%hash ); my $top = get_top_n( $heap, $find_n ); foreach my $element ( @{$top} ) { print "$element->[0]\t=>$element->[1]\n"; } sub heapify { my( $href ) = $_[0]; my $heap = Heap::Simple::Perl->new( 'order' => '>', 'elements' => ['Array'] ); $heap->insert( [ $href->{$_}, $_ ] ) foreach keys %{$href}; return $heap; } sub get_top_n { my( $heap, $n ) = @_; my $top_n ; for( 1 .. $n ) { my( $value, $key ) = @{ $heap->extract_top() }; push @{ $top_n }, [ $key, $value ]; } return $top_n; } [download] Again, a heap will be useful and efficient if you are frequently needing the top N items from the list. But it's not all that efficient if all you're doing is grabbig the top N one time. For that, just go through the list one element at a time and keep track of the top ten values (no need to sort; that's just a waste of time). You can get more information on heaps and priority queues here: Wikipedia: Heap, as well as (of course) from the POD for Heap::Simple. My favorite reference however is Mastering Algorithms with Perl Again, I don't mean to imply that a heap is definately the answer to your need. It may be (depending on how many times you'll be pulling the top-n elements), but it may just be too much work. I post it here as a point of discussion and thank you for giving me the excuse to play around with these useful datastructures again. Dave	[reply] [d/l]
Re^2: Sorting Hash By Large Number Values by thospel (Hermit) on Oct 06, 2005 at 21:04 UTC
The "Any" type is probably more natural for Heap::Simple here. The code then becomes something like: `#!/usr/bin/perl -w use strict; use Heap::Simple; my %Stats = map { $_ => int rand 1000 } "AA".."FF"; my $NumOfPortsToGet = 10; my $heap = Heap::Simple->new (order => '<', elements => 'Any', max_count => $NumOfPortsToGet, dirty => 1, ); for my $key (keys %Stats ) { $heap->key_insert($Stats{$key}, $key); } print "$_ => $Stats{$_}\n" while defined($_ = $heap->extract_first);` [download] If you have the perl version of Heap::Simple, this will be slower than simple non-heap solutions upto a max_count of up to about 30 or so. After that the heap will win (measured without accounting for the time of the first new()) Heap::Simple::Perl does partial evaluation with the arguments to new() to build a (relatively) efficient set of methods for exactly the type of heap you want to use. That makes the (first) new() slower, but actual runtime faster than you might expect. If you have the XS version of Heap::Simple it will beat the naive version for almost any max_count. And I think its easier to understand in all cases.	[reply] [d/l]
Re^2: Sorting Hash By Large Number Values by anonymized user 468275 (Curate) on Aug 03, 2005 at 13:08 UTC
Inserting each hash key into a heap doesn't seem to offer a great deal of performance gain if any, given that on the one hand each insert is becoming more and more costly and on the other perl knows it can optimise the sort to take advantage of the hash, whereas the heapify can't do that in this example. I suspect therefore that a great deal more brain-grief would be needed to beat the simplistic sort algorithm even though only the top n are needed. One world, one people	[reply]
Re^3: Sorting Hash By Large Number Values by davido (Cardinal) on Aug 03, 2005 at 15:31 UTC
By adding " `'max_count' => 10` " at heap creation time, you can limit the number of elements tracked in the heap. As new elements are added, inferior elements past the top ten will be dropped. That will reduce the amount of time needed to add elements and improve the efficiency of the heap solution. Of course then you don't have a heap full of values, but instead only the top ten, which may limit the datastructure's usefulness later in the script if the heap is needed again. Dave	[reply] [d/l]
Re^4: Sorting Hash By Large Number Values by anonymized user 468275 (Curate) on Aug 03, 2005 at 15:40 UTC
Re^2: Sorting Hash By Large Number Values by Limbic~Region (Chancellor) on Aug 03, 2005 at 13:27 UTC
davido, Sorting so as to grab the top N elements is only efficient if you need the list to be sorted anyway. If you haven't seen it, you might want to take a look at Better mousetrap (getting top N values from list X). I am not sure your statement is always true though in general it is good advice. Cheers - L~R	[reply]
Re^2: Sorting Hash By Large Number Values by anonymized user 468275 (Curate) on Aug 04, 2005 at 08:16 UTC
Having looked into this further, it seems to me that Heap::Simple requires too much overhead, whereas Heap::Elem::NumRev implements the heap efficiently for a reversed numeric sorting order. One world, one people	[reply]
Re: Sorting Hash By Large Number Values by anonymized user 468275 (Curate) on Aug 03, 2005 at 09:40 UTC
The 'fixed' OP will nevertherless continue running through items 11 thru millions and test against $c for all those iterations. How about using 'last' to get out of the loop when the job is done. I also rearranged the condition handling for brevity: `$NumOfPortsToGet = 10; $c = 0; foreach $key (sort {$Stats{$a} <=> $Stats{$b}} keys %Stats) { ( $c < $NumOfPortsToGet ) or last; ($key eq "UnauthOrigin") or $Ports[$c++] = $key; }` [download] One world, one people	[reply] [d/l]