NathanE has asked for the wisdom of the Perl Monks concerning the following question:

Hey all;

I need to find the top $x number of scores out of a hash table, where the scores can be large or small (1 to millions). The hash table is in the format of $Stats{$PortNumber} = $score.

Initially I thought that I could simply pull the top $x (10) scores by obtaining their keys ($PortNumber) for use later in the program. So, I used the following code:

$NumOfPortsToGet = 10; $c = 0; foreach $value (sort {$Stats{$a} cmp $Stats{$b}} keys %Stats) { if ($c < $NumOfPortsToGet && $value ne "UnauthOrigin") { $Ports[$c] = $value; $c++; } }
However, after trying this, it would always come back with entries that had a score of 99. After printing out the entire list of scores, it became immediately apparent what the problem was. I have other scores like 98347, but those are ranked down with the 98 scores. This happens throughout the rankings.

I thought about zero-padding the numbers, but I do not know how large the numbers may actual get in the wild (they could get extremely large); making it difficult to efficiently zero-pad the numbers (if that would even work).

Does anyone have any idea how I could get the REAL value sort out of this? I haven't been able to find anything as yet through searches.

Thanks for everyones help and take care.

UPDATE: FIXED - STUPID MISTAKE (cmp change to <=>) thanks JediWizard

Replies are listed 'Best First'.
Re: Sorting Hash By Large Number Values
by JediWizard (Deacon) on Aug 02, 2005 at 23:36 UTC

    you need to use numeric comparison rather than string comparison. Aka. <=> or the "spacship operator". Change cmp to <=>. See perlop

    Update In the intrest of being complete: You may have troubles if not all values are numeric. I believe that Regex::Common::Number can help you identify numbers as opposed to strings.

    Update 2: I just checked.... numbers should sort lexicographically prior to letters... that should not cause much of an issue. Sorry for the false lead.


    They say that time changes things, but you actually have to change them yourself.

    —Andy Warhol

Re: Sorting Hash By Large Number Values
by davido (Cardinal) on Aug 03, 2005 at 05:53 UTC

    Sorting so as to grab the top N elements is only efficient if you need the list to be sorted anyway.

    On the other hand, if you just need top N, and you need it again from time to time, you might use a heap. Heaps are great for keeping track of what comes at the top, without wasting a lot of time keeping the bottom precisely sorted. If all you want is the top N elements just once, a heap is overkill, but I couldn't resist the opportunity to brush up on my recollection of how to use Heap::Simple.

    The following snippet generates a hash of a whole bunch of elements. The keys are 'AA' through 'FF', and the values are random integers from 0 through 999. The hash is then converted to a heap (this constitutes the bulk of the overhead) where the heap is heap-ordered by values from the hash. Key/Value pairs are stored in anonymous arrays held in the heap. When you pluck an item from the heap, it's set up to yield the item with the greatest value. The next item you pluck will be the item with second greatest value, and so on.

    Here's the example snippet...

    use strict; use warnings; use Heap::Simple::Perl; my %hash; $hash{$_} = int( rand( 1000 ) ) for 'AA' .. 'FF'; my $find_n = 10; my $heap = heapify( \%hash ); my $top = get_top_n( $heap, $find_n ); foreach my $element ( @{$top} ) { print "$element->[0]\t=>$element->[1]\n"; } sub heapify { my( $href ) = $_[0]; my $heap = Heap::Simple::Perl->new( 'order' => '>', 'elements' => ['Array'] ); $heap->insert( [ $href->{$_}, $_ ] ) foreach keys %{$href}; return $heap; } sub get_top_n { my( $heap, $n ) = @_; my $top_n ; for( 1 .. $n ) { my( $value, $key ) = @{ $heap->extract_top() }; push @{ $top_n }, [ $key, $value ]; } return $top_n; }

    Again, a heap will be useful and efficient if you are frequently needing the top N items from the list. But it's not all that efficient if all you're doing is grabbig the top N one time. For that, just go through the list one element at a time and keep track of the top ten values (no need to sort; that's just a waste of time).

    You can get more information on heaps and priority queues here: Wikipedia: Heap, as well as (of course) from the POD for Heap::Simple. My favorite reference however is Mastering Algorithms with Perl

    Again, I don't mean to imply that a heap is definately the answer to your need. It may be (depending on how many times you'll be pulling the top-n elements), but it may just be too much work. I post it here as a point of discussion and thank you for giving me the excuse to play around with these useful datastructures again.


    Dave

      The "Any" type is probably more natural for Heap::Simple here. The code then becomes something like:
      #!/usr/bin/perl -w use strict; use Heap::Simple; my %Stats = map { $_ => int rand 1000 } "AA".."FF"; my $NumOfPortsToGet = 10; my $heap = Heap::Simple->new (order => '<', elements => 'Any', max_count => $NumOfPortsToGet, dirty => 1, ); for my $key (keys %Stats ) { $heap->key_insert($Stats{$key}, $key); } print "$_ => $Stats{$_}\n" while defined($_ = $heap->extract_first);
      If you have the perl version of Heap::Simple, this will be slower than simple non-heap solutions upto a max_count of up to about 30 or so. After that the heap will win (measured without accounting for the time of the first new())

      Heap::Simple::Perl does partial evaluation with the arguments to new() to build a (relatively) efficient set of methods for exactly the type of heap you want to use. That makes the (first) new() slower, but actual runtime faster than you might expect.

      If you have the XS version of Heap::Simple it will beat the naive version for almost any max_count.

      And I think its easier to understand in all cases.

      Inserting each hash key into a heap doesn't seem to offer a great deal of performance gain if any, given that on the one hand each insert is becoming more and more costly and on the other perl knows it can optimise the sort to take advantage of the hash, whereas the heapify can't do that in this example. I suspect therefore that a great deal more brain-grief would be needed to beat the simplistic sort algorithm even though only the top n are needed.

      One world, one people

        By adding " 'max_count' => 10 " at heap creation time, you can limit the number of elements tracked in the heap. As new elements are added, inferior elements past the top ten will be dropped. That will reduce the amount of time needed to add elements and improve the efficiency of the heap solution. Of course then you don't have a heap full of values, but instead only the top ten, which may limit the datastructure's usefulness later in the script if the heap is needed again.


        Dave

      davido,
      Sorting so as to grab the top N elements is only efficient if you need the list to be sorted anyway.

      If you haven't seen it, you might want to take a look at Better mousetrap (getting top N values from list X). I am not sure your statement is always true though in general it is good advice.

      Cheers - L~R

      Having looked into this further, it seems to me that Heap::Simple requires too much overhead, whereas Heap::Elem::NumRev implements the heap efficiently for a reversed numeric sorting order.

      One world, one people

Re: Sorting Hash By Large Number Values
by anonymized user 468275 (Curate) on Aug 03, 2005 at 09:40 UTC
    The 'fixed' OP will nevertherless continue running through items 11 thru millions and test against $c for all those iterations. How about using 'last' to get out of the loop when the job is done. I also rearranged the condition handling for brevity:
    $NumOfPortsToGet = 10; $c = 0; foreach $key (sort {$Stats{$a} <=> $Stats{$b}} keys %Stats) { ( $c < $NumOfPortsToGet ) or last; ($key eq "UnauthOrigin") or $Ports[$c++] = $key; }

    One world, one people