smellysocks has asked for the wisdom of the Perl Monks concerning the following question:

I was surprised that this wasn't listed in the HASH FAQ or did I miss it? I have a VERY LARGE HASH (ie 10+ mil entries) and I'd like to print the hash sorted by the value (see sample code bellow). Before you ask: I only need to print the top N of records. I was wondering if anyone has any special incantation that was faster then what I have bellow.
#!/usr/local/bin/perl -wT use strict; my %HASH=( brown=>1, black=>2, red=>5, fub=>3 ); print " Count |\tKey\n","-" x 60, "\n"; foreach my $key ( sort { $HASH{$b} <=> $HASH{$a} } ( keys(%HASH))) { printf ("%8s |\t%s\n", $HASH{$key}, $key); } # end of foreach
Ideas/comments?

Replies are listed 'Best First'.
Re: *Fastest* way to print a hash sorted by value (heap)
by tye (Sage) on Aug 08, 2003 at 22:03 UTC

    This is what a "heap sort" is great for (finding and sorting the top N items of a very large list). I keep wanting to implement this in Perl. I don't think the Heap modules include this functionality (which I find a bit strange, but *shrug*, maybe I'm just not seeing it -- the modules are a bit more complicated than I expect so I either didn't dig very deep into them or I just don't at the moment remember having done so).

    In any case, the technique isn't horribly difficult and I'm sure google will find you several good tutorials on it.

    Update: So far, all of the other responses are waving their hands vaguely around "keep track of the top N items". That sounds simple enough but some ways are certainly more efficient for doing this than others.

    The first naive method would be to keep the top N items unsorted in a list and just compare new items as you get them against every single item in the list until you find one that you can replace with the new one, or finish comparing and reject the new one. That is a ton of compares, perhaps even more than just sorting the full list.

    The second naive method would be to keep the top N items in sorted order so that you can do a single compare to tell whether to reject the new item. But, if you decide to keep it, then you have to insert it into the list in sorted order. That will take you log(N) compares (if you use a binary search) and then you'll have to shift about 1/2 of the list over to fit it in (or use something like a linked list which isn't very Perlish and so are a bit slower).

    The heap sort keeps the top N items in an array in a partially sorted order. You can still tell whether a new item should be insert or not with a single comparison. However, the algorithm tells you how to only use log(N) comparisons and log(N) swappings of items to insert a new item while preserving the partial ordering. Then it tells you how to pull the items out of the "heap" so that they all come out in fully sorted order.

                    - tye
Re: *Fastest* way to print a hash sorted by value
by dws (Chancellor) on Aug 08, 2003 at 21:47 UTC
    What "fast" means in this situation will have a lot to do with how much physical memory you have. At 10+ million entries, how close to virtual memory thrashing are you? If you're close, then keys %HASH might push you over the edge.

    I'd be inclined to try an approach that doesn't add to the memory overhead. Something like

    my @largest; while ( my($k,$v) = each %HASH ) { if ( @largest < $N ) { push @largest, $v; } else { # if $v is greater than the smallest element # in @largest, replace that element with $v } }
    I've left you with a little work to do, but that's the general idea.

      The "little extra work" includes keeping the resulting array sorted. Otherwise, there's no way to detect which is currently the smallest value, and which one needs replacing. Only $n items need to be printed, and if $n is small enough, I can imagine that perl's built-in sort will do well enough. As the data is sorted, you can try insert a new value at the appropriate place yourself, but the overhead of doing this in pure perl might not be worth it.

      Also, personally, I'd prefer to keep more than $n items if there's a tie for the lowest value of those to keep. For example, if you need the top 3 values and the top 5 hash values are (1, 2, 4, 4, 5), I'd actually keep the top 4: (1, 2, 4, 4). So here's some code that implements that. I also added lots of reporting code, showing exactly what's going on.
      Feel free to "optimize" it.

      my %HASH; foreach('AA' .. 'ZZ') { $HASH{$_} = 1 + int rand 1000; } my $keep = 20; my @top; if(scalar keys %HASH <= $keep) { # You actually need to show them all. @top = sort { $b->[1] <=> $a->[1] } map [$_, $HASH{$_}], keys %HAS +H; } else { my $threshold; while(my($key, $value) = each %HASH) { if(@top < $keep) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; $threshold = $value if !defined $threshold or $value > $th +reshold; } elsif($value > $threshold) { @top = sort { $b->[1] <=> $a->[1] } @top, [$key, $value]; printf "Inserting %s => %d\n", $key => $value; $threshold = $top[$keep-1][1]; printf "New threshold: %d\n", $threshold; while($top[-1][1] < $threshold) { printf "Popping %s => %d\n", @{$top[-1]}; pop @top; } printf "Keeping %d items\n", scalar @top; } elsif($value == $threshold) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; } } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper \@top;
        The hidden trap in this suggestion is
        if(scalar keys %HASH <= $keep) {
        The risk is that if %HASH really does hold 10 million entries, and you haven't yet blown memory and started thrashing badly, loading all of the keys into a new temporary array will push you over the edge. (Unless Perl is clever enough to optimize this case, which I don't believe it is. Does anyone know otherwise?)
        See update.

        Virtual memory is a real nuisance. A lot of schemes that work fast on paper collapse entirely once you start having to start swapping memory to disk.

        Update: BrowserUk has graciously pointed out that keys in scalar context returns the number of keys.

        Feel free to "optimize" it.

        He's gonna need some optimisations, or War & peace:)

        The problem with this method is that you have to re-sort the keep array every time you add a new key.

        If you are keeping the top 100 elements from 10_000_000, you are going to be sorting the 100 element array 9_999_901 times. Even with the existing 100 elements being already sorted, and despite 5.8's default mergesort and its ability to "takes advantage of pre-existing order", adding a single value to a pre-sorted array of one hundred elements, still seems to take an average of around 120 calls to the comparator.

        use sort ''; print sort::current; mergesort @array = 1..99; $c=0; @b = sort{ $c++; $a <=> $b } @array, 0; print $c 113; $c=0; @b = sort{ $c++; $a <=> $b } @array, 100; print $c 99; $c=0; @b = sort{ $c++; $a <=> $b } @array, 50; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 49; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 75; print $c 127 $c=0; @b = sort{ $c++; $a <=> $b } @array, 12; print $c 122

        I think this falls into tyes second category.

        Using a binary chop to locate the insertion point will take at most 8 comparisons and an average of 5. Add a couple of pre-comparisons to save doing the chop when the element to be added is less than the smallest to date, or greater than the largest, and you can avoid needing to do the binary chop much of the time.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

        OK I found a way to make it faster. The compromise is to trade speed for space... Instead of calling sort for every item that gets added to the list of the select few, I collect $workset extra items before doing the sort... That reduces the total number of calls to a relatively small number, roughly $workset times fewer calls. In my example code, with $workset = 100, I got just 3 calls total.
        use strict; use Time::HiRes 'time'; my %HASH; foreach('AA' .. 'ZZ') { $HASH{$_} = 1 + int rand 1000; } my $t0 = time; my @top; { my $keep = 100; my $workset = 100; my $threshold; my $added = 0; while(my($key, $value) = each %HASH) { if(@top < $keep) { printf "Starting with %s => %d\n", $key => $value; push @top, [$key, $value]; } elsif(not defined $threshold or $value > $threshold) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; if(++$added >= $workset) { printf "Shakedown from %d items\n", scalar @top; @top = sort { $b->[1] <=> $a->[1] } @top; $threshold = $top[my $i = $keep - 1][1]; printf "New threshold: %d\n", $threshold; while($top[++$i][1] == $threshold) { } $#top = $i-1; printf "Keeping %d items\n", scalar @top; $added = 0; } } elsif($value == $threshold) { printf "Adding %s => %d\n", $key => $value; push @top, [$key, $value]; } else { printf "Skipping %s => %d\n", $key => $value; } } printf "Final shakedown from %d items\n", scalar @top; @top = sort { $b->[1] <=> $a->[1] } @top; if(@top > $keep) { $threshold = $top[my $i = $keep - 1][1]; printf "Final threshold: %d\n", $threshold; while($top[++$i][1] == $threshold) { } $#top = $i-1; printf "Keeping %d items\n", scalar @top; } } use Data::Dumper; $Data::Dumper::Terse = 1; print Dumper scalar @top, \@top; my $t1 = time; $, = "\t"; print $t1-$t0;
Re: *Fastest* way to print a hash sorted by value
by Abigail-II (Bishop) on Aug 08, 2003 at 21:53 UTC
    The fatest way would be to print them to /dev/null.

    More seriously, there are basically two options, sort the entries, and take the first $N value of the sorted list. Or go through the hash once, and keep track of the top $N entries. The latter approach means you do fewer comparisons, the former approach mean you do less work in Perl, and more in C. Which one will be faster depends on the size of your hash, $N, and the amount of core memory available to your process.

    Abigail

Re: *Fastest* way to print a hash sorted by value
by cchampion (Curate) on Aug 08, 2003 at 23:43 UTC

    Maybe a hash is not the best data structure to solve your problem.

    I would use a binary tree or a B-tree, on the grounds that inserting is relatively cheap, and the inserted items are already sorted. Finding the first N items in sorted order would just be a simple matter of walking down the tree while keeping a count. When the count is over, you stop listing.

    Now, what tool provides such mechanism? A DBMS for sure. You insert your values in a table with an index on the columns you need to keep in order, and you get your list faster than you could possibly do with some hard coded algorithms in your script.

    Using a database, your sorted list would be just a combination of ORDER BY and LIMIT (choose a DBMS that supports such feature).

    You would pay with slower insertion (inserting records into a table is slower than using a hash, but you would minimize the retrieval time, since sorting (by index) is already done.)

    BTW, if merlyn were here, he would say that he has a column on this subject (or maybe two).

Re: *Fastest* way to print a hash sorted by value
by TomDLux (Vicar) on Aug 08, 2003 at 22:03 UTC

    Where do your 10,000,000 key-value pairs come from? The system with the least overhead would be to determine the max N while loading them in, obtaining them from mars, etc. Since you have to go through the input once, anyway, during input, you can keep track of the top N at the same time.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      I'm grabbing the data from our firewall logs. I figure there isn't a way around doing one pass through the logs before I can find process them.
      I.e. How many times does IP 10.0.0.10 goto 192.168.0.10?
      I still have to figure what the theorotical max of processing all of the data is (i.e. Just read file and do nothing).
      I'm starting to think there isn't much room for me to speed my script up.

        You might try using external programs that are optimized to deal with such situations efficiently. I wish standard 'sort' knew how to include a count when doing 'sort -u' (like 'uniq -c' does). But it might be worth trying to have your Perl script just extract the IP addresses and output "10.0.0.10 192.168.0.10\n" via a pipe to "sort | uniq -c | sort +n | head -$N".

                        - tye
        I know this is a Perl forum, but sometimes Perl by itself is not the best tool for the job. You might consider testing the speed of using another program like sort(1) on the file before processing it with Perl (or just run sort from your Perl program).

        Check out the helpful advice in this thread: Fast/Efficient Sort for Large Files

        Then, you can read in the lines until you have the top N values and you're done.

        If your log files are not easily sortable using sort(1) because of strange formatting and delimiters (like web server access logs), you may pre-process them (with Perl, of course) so that they have nice column separators for sort to recognize.

        At least those can be compressed well. Let's see... something like:
        my %occurrences; while(<>) { # Extract the IP addresses my ($from_ip, $to_ip) = /...some regex.../ or next; # Losslessly compress them into an 8-byte key my $key = pack("C8", split(/\./, "$from_ip.$to_ip")); $occurrences{$key}++; } # Now convert the counts to small, easily sorted things my @data; while (my ($key, $count) = each %occurrences) { # Use network byte-order (big-endian) because it # will sort alphabetically in numeric order push @data, pack("N", $count) . $key; # To save space, shrink the hash. Note that it is # normally unsafe to modify the set of entries in # a hash while iterating over them, but deleting # the current entry returned by 'each' is explicitly # allowed (perldoc -f each). delete $occurrences{$key}; } # Now we have a simple array of (packed) counts and IP # pairs. The easiest thing to implement would then be: return (reverse sort @data)[0..$N-1]; # A trickier implementation that would avoid the superlinear # expense of sorting would be to grab a random element and # partition the array into a group of elements that are # greater than that element, and a group of those that are # smaller. As soon as the size of the group that is greater # exceeds $N, throw out the other group and just finish # creating the group of larger elements. If you're unlucky # and your "larger" group never gets up to $N elements, # start over again with a smaller pivot element. # # Repeat until you are left with a more manageable size, # and then just do the sort. while (@data > 10000) { # Assuming $N < 10000 my (@big, @small); # The first element may be "random" enough, since it's # in perl's hash order. But you really probably ought # to pick a better one, just in case. my $pivot = shift(@data); while (@data) { my $elt = shift(@data); ($elt gt $pivot) ? push(@big,$elt) : push(@small,$elt); if (@big >= $N) { while (@data) { $elt = shift(@data); push(@big, $elt) if $elt gt $pivot; } last; } } if (@big >= $N) { @data = @big; } else { # Put the pivot at the end so we don't reuse it! push @data, @big, @small, $pivot; } }
        Unpacking the result is left as an exercise for the reader. I didn't test this all that much, but it seems to more or less work, and the idea is the main thing.
Re: *faster sort* ( 83% saved for 1M; for 10M ??)
by BrowserUk (Patriarch) on Aug 09, 2003 at 10:10 UTC

    Using a binary chop to insert elements into a array of the highest N elements seems to work quite well, taking around 1/6 th time to find the highest 100 elements from 1_000_000 despite being pure perl.

    The saving at 500_000 is around 2 /3rds.

    P:\test>282315 -N=100 -MAX=500000 -TRIALS=1 499984 499984 499984 499984 499984 499984 499984 499984 499984 499984 +499984 499984 499984 499984 499984 499984 499984 499984 499984 499984 499984 +499984 1 trial of insertion sort (8.002s total) 1 trial of standard sort & slice (26.198s total)

    The saving at 1_000_000 is around 5 /6ths.

    P:\test>282315 -N=100 -MAX=1000000 -TRIALS=1 999969 999969 999969 999969 999969 999969 999969 999969 999969 999969 +999969 9 999969 999969 999969 999969 999969 999969 999969 999969 999969 999969 +999969 9 1 trial of insertion sort (15.733s total) 1 trial of standard sort & slice (90.730s total)

    The saving at 10_000_000 will be ?? 99/100 ths ??? I don't have enough memory to try this!

    You could probably save even more if you don't have to have the whole 10_000_000 in memory for any other reason than the sort by reading them in in batches of 10_000 or so and pasing them to iSortN(). So long as you pass in the same arrayref as the first parameter, you can sort in batches or even one at a time (with the obvious penaties).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: *Fastest* way to print a hash sorted by value
by Anonymous Monk on Aug 09, 2003 at 02:05 UTC
    Have you tried DB_File?

      And the expansion on this is - DB_File (or its big brother BerkeleyDB has a tied hash database (or you can eschew the tying if you like and just it directly. Somewhat like Cache::Cache except different) which is sorted by key. You're pushing the sorting off to the SleepyCat db in C and during the store operation instead of retrieval. Its almost certain you'd want to tweak some of the cache or page settings though.