in reply to *Fastest* way to print a hash sorted by value

What "fast" means in this situation will have a lot to do with how much physical memory you have. At 10+ million entries, how close to virtual memory thrashing are you? If you're close, then keys %HASH might push you over the edge.

I'd be inclined to try an approach that doesn't add to the memory overhead. Something like

my @largest; while ( my($k,$v) = each %HASH ) { if ( @largest < $N ) { push @largest, $v; } else { # if $v is greater than the smallest element # in @largest, replace that element with $v } }
I've left you with a little work to do, but that's the general idea.

Replies are listed 'Best First'.
Re: Re: *Fastest* way to print a hash sorted by value
by bart (Canon) on Aug 09, 2003 at 11:40 UTC
    The "little extra work" includes keeping the resulting array sorted. Otherwise, there's no way to detect which is currently the smallest value, and which one needs replacing. Only $n items need to be printed, and if $n is small enough, I can imagine that perl's built-in sort will do well enough. As the data is sorted, you can try insert a new value at the appropriate place yourself, but the overhead of doing this in pure perl might not be worth it.

    Also, personally, I'd prefer to keep more than $n items if there's a tie for the lowest value of those to keep. For example, if you need the top 3 values and the top 5 hash values are (1, 2, 4, 4, 5), I'd actually keep the top 4: (1, 2, 4, 4). So here's some code that implements that. I also added lots of reporting code, showing exactly what's going on.
    Feel free to "optimize" it.

    my %HASH; foreach('AA' .. 'ZZ') { $HASH{$_} = 1 + int rand 1000; } my $keep = 20; my @top; if(scalar keys %HASH <= $keep) { # You actually need to show them all. @top = sort { $b->[1] <=> $a->[1] } map [$_, $HASH{$_}], keys %HAS +H; } else { my $threshold; while(my($key, $value) = each %HASH) { if(@top < $keep) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; $threshold = $value if !defined $threshold or $value > $th +reshold; } elsif($value > $threshold) { @top = sort { $b->[1] <=> $a->[1] } @top, [$key, $value]; printf "Inserting %s => %d\n", $key => $value; $threshold = $top[$keep-1][1]; printf "New threshold: %d\n", $threshold; while($top[-1][1] < $threshold) { printf "Popping %s => %d\n", @{$top[-1]}; pop @top; } printf "Keeping %d items\n", scalar @top; } elsif($value == $threshold) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; } } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper \@top;
      The hidden trap in this suggestion is
      if(scalar keys %HASH <= $keep) {
      The risk is that if %HASH really does hold 10 million entries, and you haven't yet blown memory and started thrashing badly, loading all of the keys into a new temporary array will push you over the edge. (Unless Perl is clever enough to optimize this case, which I don't believe it is. Does anyone know otherwise?)
      See update.

      Virtual memory is a real nuisance. A lot of schemes that work fast on paper collapse entirely once you start having to start swapping memory to disk.

      Update: BrowserUk has graciously pointed out that keys in scalar context returns the number of keys.

        I can imagine that calling keys in scalar context for a tied hash is really slow — as slow as the loop for the actual job, later on. Therefore, perhaps I should keep a running counter, and postpone the test until after everything is done. The season for the test is really because I collect $keep items without sorting them. In this case, we need a final sort.

        Perhaps we can do a final sort in any case, and skip keeping the counter.

      Feel free to "optimize" it.

      He's gonna need some optimisations, or War & peace:)

      The problem with this method is that you have to re-sort the keep array every time you add a new key.

      If you are keeping the top 100 elements from 10_000_000, you are going to be sorting the 100 element array 9_999_901 times. Even with the existing 100 elements being already sorted, and despite 5.8's default mergesort and its ability to "takes advantage of pre-existing order", adding a single value to a pre-sorted array of one hundred elements, still seems to take an average of around 120 calls to the comparator.

      use sort ''; print sort::current; mergesort @array = 1..99; $c=0; @b = sort{ $c++; $a <=> $b } @array, 0; print $c 113; $c=0; @b = sort{ $c++; $a <=> $b } @array, 100; print $c 99; $c=0; @b = sort{ $c++; $a <=> $b } @array, 50; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 49; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 75; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 12; print $c 122

      I think this falls into tyes second category.

      Using a binary chop to locate the insertion point will take at most 8 comparisons and an average of 5. Add a couple of pre-comparisons to save doing the chop when the element to be added is less than the smallest to date, or greater than the largest, and you can avoid needing to do the binary chop much of the time.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        He's gonna need some optimisations, or War & peace:)
        He must be a quick reader then.

        I timed the above program keeping 100 items which I think is reasonable. "AA" .. "ZZ" returns 676 items, thus that's about 575 possible loops that may use sort — which only actually happens if there's a new value in the current top 100. Eventually that should become quite rare. The program takes about 1/6th of a second on my system, a humble Pentium 500MHz. Doing the same for 1 million records will thus take about 6 minutes.

        p.s. The loop without the sort takes 50ms. Thus, the sort takes about 2.4 times as long as the loop itself, or 70% of the total running time.

      OK I found a way to make it faster. The compromise is to trade speed for space... Instead of calling sort for every item that gets added to the list of the select few, I collect $workset extra items before doing the sort... That reduces the total number of calls to a relatively small number, roughly $workset times fewer calls. In my example code, with $workset = 100, I got just 3 calls total.
      use strict; use Time::HiRes 'time'; my %HASH; foreach('AA' .. 'ZZ') { $HASH{$_} = 1 + int rand 1000; } my $t0 = time; my @top; { my $keep = 100; my $workset = 100; my $threshold; my $added = 0; while(my($key, $value) = each %HASH) { if(@top < $keep) { printf "Starting with %s => %d\n", $key => $value; push @top, [$key, $value]; } elsif(not defined $threshold or $value > $threshold) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; if(++$added >= $workset) { printf "Shakedown from %d items\n", scalar @top; @top = sort { $b->[1] <=> $a->[1] } @top; $threshold = $top[my $i = $keep - 1][1]; printf "New threshold: %d\n", $threshold; while($top[++$i][1] == $threshold) { } $#top = $i-1; printf "Keeping %d items\n", scalar @top; $added = 0; } } elsif($value == $threshold) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; } else { printf "Skipping %s => %d\n", $key => $value; } } printf "Final shakedown from %d items\n", scalar @top; @top = sort { $b->[1] <=> $a->[1] } @top; if(@top > $keep) { $threshold = $top[my $i = $keep - 1][1]; printf "Final threshold: %d\n", $threshold; while($top[++$i][1] == $threshold) { } $#top = $i-1; printf "Keeping %d items\n", scalar @top; } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper scalar @top, \@top; my $t1 = time; $, = "\t"; print $t1-$t0;