Re: *Fastest* way to print a hash sorted by value (heap)
by tye (Sage) on Aug 08, 2003 at 22:03 UTC
|
This is what a "heap sort" is great for (finding and sorting the top N items of a very large list). I keep wanting to implement this in Perl. I don't think the Heap modules include this functionality (which I find a bit strange, but *shrug*, maybe I'm just not seeing it -- the modules are a bit more complicated than I expect so I either didn't dig very deep into them or I just don't at the moment remember having done so).
In any case, the technique isn't horribly difficult and I'm sure google will find you several good tutorials on it.
Update: So far, all of the other responses are waving their hands vaguely around "keep track of the top N items". That sounds simple enough but some ways are certainly more efficient for doing this than others.
The first naive method would be to keep the top N items unsorted in a list and just compare new items as you get them against every single item in the list until you find one that you can replace with the new one, or finish comparing and reject the new one. That is a ton of compares, perhaps even more than just sorting the full list.
The second naive method would be to keep the top N items in sorted order so that you can do a single compare to tell whether to reject the new item. But, if you decide to keep it, then you have to insert it into the list in sorted order. That will take you log(N) compares (if you use a binary search) and then you'll have to shift about 1/2 of the list over to fit it in (or use something like a linked list which isn't very Perlish and so are a bit slower).
The heap sort keeps the top N items in an array in a partially sorted order. You can still tell whether a new item should be insert or not with a single comparison. However, the algorithm tells you how to only use log(N) comparisons and log(N) swappings of items to insert a new item while preserving the partial ordering. Then it tells you how to pull the items out of the "heap" so that they all come out in fully sorted order.
- tye
| [reply] |
|
|
| [reply] |
Re: *Fastest* way to print a hash sorted by value
by dws (Chancellor) on Aug 08, 2003 at 21:47 UTC
|
my @largest;
while ( my($k,$v) = each %HASH ) {
if ( @largest < $N ) {
push @largest, $v;
} else {
# if $v is greater than the smallest element
# in @largest, replace that element with $v
}
}
I've left you with a little work to do, but that's the general idea.
| [reply] [d/l] [select] |
|
|
The "little extra work" includes keeping the resulting array sorted. Otherwise, there's no way to detect which is currently the smallest value, and which one needs replacing. Only $n items need to be printed, and if $n is small enough, I can imagine that perl's built-in sort will do well enough. As the data is sorted, you can try insert a new value at the appropriate place yourself, but the overhead of doing this in pure perl might not be worth it.
Also, personally, I'd prefer to keep more than $n items if there's a tie for the lowest value of those to keep. For example, if you need the top 3 values and the top 5 hash values are (1, 2, 4, 4, 5), I'd actually keep the top 4: (1, 2, 4, 4).
So here's some code that implements that. I also added lots of reporting code, showing exactly what's going on.
Feel free to "optimize" it.
my %HASH;
foreach('AA' .. 'ZZ') {
$HASH{$_} = 1 + int rand 1000;
}
my $keep = 20;
my @top;
if(scalar keys %HASH <= $keep) {
# You actually need to show them all.
@top = sort { $b->[1] <=> $a->[1] } map [$_, $HASH{$_}], keys %HAS
+H;
} else {
my $threshold;
while(my($key, $value) = each %HASH) {
if(@top < $keep) {
printf "Adding %s => %d\n", $key => $value;
push @top, [$key, $value];
$threshold = $value if !defined $threshold or $value > $th
+reshold;
} elsif($value > $threshold) {
@top = sort { $b->[1] <=> $a->[1] } @top, [$key, $value];
printf "Inserting %s => %d\n", $key => $value;
$threshold = $top[$keep-1][1];
printf "New threshold: %d\n", $threshold;
while($top[-1][1] < $threshold) {
printf "Popping %s => %d\n", @{$top[-1]};
pop @top;
}
printf "Keeping %d items\n", scalar @top;
} elsif($value == $threshold) {
printf "Adding %s => %d\n", $key => $value;
push @top, [$key, $value];
}
}
}
use Data::Dumper;
$Data::Dumper::Terse = 1;
print Dumper \@top;
| [reply] [d/l] |
|
|
The hidden trap in this suggestion is
if(scalar keys %HASH <= $keep) {
The risk is that if %HASH really does hold 10 million entries, and you haven't yet blown memory and started thrashing badly, loading all of the keys into a new temporary array will push you over the edge. (Unless Perl is clever enough to optimize this case, which I don't believe it is. Does anyone know otherwise?) See update.
Virtual memory is a real nuisance. A lot of schemes that work fast on paper collapse entirely once you start having to start swapping memory to disk.
Update: BrowserUk has graciously pointed out that keys in scalar context returns the number of keys.
| [reply] [d/l] [select] |
|
|
|
|
Feel free to "optimize" it.
He's gonna need some optimisations, or War & peace:)
The problem with this method is that you have to re-sort the keep array every time you add a new key.
If you are keeping the top 100 elements from 10_000_000, you are going to be sorting the 100 element array 9_999_901 times. Even with the existing 100 elements being already sorted, and despite 5.8's default mergesort and its ability to "takes advantage of pre-existing order", adding a single value to a pre-sorted array of one hundred elements, still seems to take an average of around 120 calls to the comparator.
use sort '';
print sort::current;
mergesort
@array = 1..99;
$c=0; @b = sort{ $c++; $a <=> $b } @array, 0; print $c
113;
$c=0; @b = sort{ $c++; $a <=> $b } @array, 100; print $c
99;
$c=0; @b = sort{ $c++; $a <=> $b } @array, 50; print $c
127
$c=0; @b = sort{ $c++; $a <=> $b } @array, 49; print $c
127
$c=0; @b = sort{ $c++; $a <=> $b } @array, 75; print $c
127
$c=0; @b = sort{ $c++; $a <=> $b } @array, 12; print $c
122
I think this falls into tyes second category.
Using a binary chop to locate the insertion point will take at most 8 comparisons and an average of 5. Add a couple of pre-comparisons to save doing the chop when the element to be added is less than the smallest to date, or greater than the largest, and you can avoid needing to do the binary chop much of the time.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
| [reply] [d/l] |
|
|
|
|
|
|
OK I found a way to make it faster. The compromise is to trade speed for space... Instead of calling sort for every item that gets added to the list of the select few, I collect $workset extra items before doing the sort... That reduces the total number of calls to a relatively small number, roughly $workset times fewer calls. In my example code, with $workset = 100, I got just 3 calls total.
use strict;
use Time::HiRes 'time';
my %HASH;
foreach('AA' .. 'ZZ') {
$HASH{$_} = 1 + int rand 1000;
}
my $t0 = time;
my @top;
{
my $keep = 100;
my $workset = 100;
my $threshold;
my $added = 0;
while(my($key, $value) = each %HASH) {
if(@top < $keep) {
printf "Starting with %s => %d\n", $key => $value;
push @top, [$key, $value];
} elsif(not defined $threshold or $value > $threshold) {
printf "Adding %s => %d\n", $key => $value;
push @top, [$key, $value];
if(++$added >= $workset) {
printf "Shakedown from %d items\n", scalar @top;
@top = sort { $b->[1] <=> $a->[1] } @top;
$threshold = $top[my $i = $keep - 1][1];
printf "New threshold: %d\n", $threshold;
while($top[++$i][1] == $threshold) { }
$#top = $i-1;
printf "Keeping %d items\n", scalar @top;
$added = 0;
}
} elsif($value == $threshold) {
printf "Adding %s => %d\n", $key => $value;
push @top, [$key, $value];
} else {
printf "Skipping %s => %d\n", $key => $value;
}
}
printf "Final shakedown from %d items\n", scalar @top;
@top = sort { $b->[1] <=> $a->[1] } @top;
if(@top > $keep) {
$threshold = $top[my $i = $keep - 1][1];
printf "Final threshold: %d\n", $threshold;
while($top[++$i][1] == $threshold) { }
$#top = $i-1;
printf "Keeping %d items\n", scalar @top;
}
}
use Data::Dumper;
$Data::Dumper::Terse = 1;
print Dumper scalar @top, \@top;
my $t1 = time;
$, = "\t";
print $t1-$t0;
| [reply] [d/l] |
Re: *Fastest* way to print a hash sorted by value
by Abigail-II (Bishop) on Aug 08, 2003 at 21:53 UTC
|
The fatest way would be to print them to /dev/null.
More seriously, there are basically two options, sort the
entries, and take the first $N value of the sorted list.
Or go through the hash once, and keep track of the top $N
entries. The latter approach means you do fewer comparisons,
the former approach mean you do less work in Perl, and more
in C. Which one will be faster depends on the size of your
hash, $N, and the amount of core memory available
to your process.
Abigail | [reply] |
Re: *Fastest* way to print a hash sorted by value
by cchampion (Curate) on Aug 08, 2003 at 23:43 UTC
|
Maybe a hash is not the best data structure to solve
your problem.
I would use a binary tree or a B-tree, on the
grounds that inserting is relatively cheap, and the
inserted items are already sorted. Finding the first N
items in sorted order would just be a simple matter of
walking down the tree while keeping a count. When the count
is over, you stop listing.
Now, what tool provides
such mechanism? A DBMS for sure. You insert your values in
a table with an index on the columns you need to keep in
order, and you get your list faster than you could possibly
do with some hard coded algorithms in your script.
Using a database, your sorted list would be just a
combination of ORDER BY and LIMIT (choose a DBMS that
supports such feature).
You would pay with slower
insertion (inserting records into a table is slower than
using a hash, but you would minimize the retrieval time,
since sorting (by index) is already done.)
BTW, if merlyn were here, he would say that
he has
a column on this subject (or maybe
two).
| [reply] |
Re: *Fastest* way to print a hash sorted by value
by TomDLux (Vicar) on Aug 08, 2003 at 22:03 UTC
|
Where do your 10,000,000 key-value pairs come from? The system with the least overhead would be to determine the max N while loading them in, obtaining them from mars, etc. Since you have to go through the input once, anyway, during input, you can keep track of the top N at the same time.
--
TTTATCGGTCGTTATATAGATGTTTGCA
| [reply] |
|
|
I'm grabbing the data from our firewall logs. I figure there isn't a way around doing one pass through the logs before I can find process them.
I.e. How many times does IP 10.0.0.10 goto 192.168.0.10?
I still have to figure what the theorotical max of processing all of the data is (i.e. Just read file and do nothing).
I'm starting to think there isn't much room for me to speed my script up.
| [reply] |
|
|
| [reply] |
|
|
|
|
I know this is a Perl forum, but sometimes Perl by itself is not the best tool for the job. You might consider testing the speed of using another program like sort(1) on the file before processing it with Perl (or just run sort from your Perl program).
Check out the helpful advice in this thread: Fast/Efficient Sort for Large Files
Then, you can read in the lines until you have the top N values and you're done.
If your log files are not easily sortable using sort(1) because of strange formatting and delimiters (like web server access logs), you may pre-process them (with Perl, of course) so that they have nice column separators for sort to recognize.
| [reply] |
|
|
At least those can be compressed well. Let's see... something like:
my %occurrences;
while(<>) {
# Extract the IP addresses
my ($from_ip, $to_ip) = /...some regex.../ or next;
# Losslessly compress them into an 8-byte key
my $key = pack("C8", split(/\./, "$from_ip.$to_ip"));
$occurrences{$key}++;
}
# Now convert the counts to small, easily sorted things
my @data;
while (my ($key, $count) = each %occurrences) {
# Use network byte-order (big-endian) because it
# will sort alphabetically in numeric order
push @data, pack("N", $count) . $key;
# To save space, shrink the hash. Note that it is
# normally unsafe to modify the set of entries in
# a hash while iterating over them, but deleting
# the current entry returned by 'each' is explicitly
# allowed (perldoc -f each).
delete $occurrences{$key};
}
# Now we have a simple array of (packed) counts and IP
# pairs. The easiest thing to implement would then be:
return (reverse sort @data)[0..$N-1];
# A trickier implementation that would avoid the superlinear
# expense of sorting would be to grab a random element and
# partition the array into a group of elements that are
# greater than that element, and a group of those that are
# smaller. As soon as the size of the group that is greater
# exceeds $N, throw out the other group and just finish
# creating the group of larger elements. If you're unlucky
# and your "larger" group never gets up to $N elements,
# start over again with a smaller pivot element.
#
# Repeat until you are left with a more manageable size,
# and then just do the sort.
while (@data > 10000) { # Assuming $N < 10000
my (@big, @small);
# The first element may be "random" enough, since it's
# in perl's hash order. But you really probably ought
# to pick a better one, just in case.
my $pivot = shift(@data);
while (@data) {
my $elt = shift(@data);
($elt gt $pivot) ? push(@big,$elt) : push(@small,$elt);
if (@big >= $N) {
while (@data) {
$elt = shift(@data);
push(@big, $elt) if $elt gt $pivot;
}
last;
}
}
if (@big >= $N) {
@data = @big;
} else {
# Put the pivot at the end so we don't reuse it!
push @data, @big, @small, $pivot;
}
}
Unpacking the result is left as an exercise for the reader. I didn't test this all that much, but it seems to more or less work, and the idea is the main thing.
| [reply] [d/l] |
Re: *faster sort* ( 83% saved for 1M; for 10M ??)
by BrowserUk (Patriarch) on Aug 09, 2003 at 10:10 UTC
|
Using a binary chop to insert elements into a array of the highest N elements seems to work quite well, taking around 1/6 th time to find the highest 100 elements from 1_000_000 despite being pure perl.
The saving at 500_000 is around 2 /3rds.
P:\test>282315 -N=100 -MAX=500000 -TRIALS=1
499984 499984 499984 499984 499984 499984 499984 499984 499984 499984
+499984
499984 499984 499984 499984 499984 499984 499984 499984 499984 499984
+499984
1 trial of insertion sort (8.002s total)
1 trial of standard sort & slice (26.198s total)
The saving at 1_000_000 is around 5 /6ths.
P:\test>282315 -N=100 -MAX=1000000 -TRIALS=1
999969 999969 999969 999969 999969 999969 999969 999969 999969 999969
+999969 9
999969 999969 999969 999969 999969 999969 999969 999969 999969 999969
+999969 9
1 trial of insertion sort (15.733s total)
1 trial of standard sort & slice (90.730s total)
The saving at 10_000_000 will be ?? 99/100 ths ??? I don't have enough memory to try this!
You could probably save even more if you don't have to have the whole 10_000_000 in memory for any other reason than the sort by reading them in in batches of 10_000 or so and pasing them to iSortN(). So long as you pass in the same arrayref as the first parameter, you can sort in batches or even one at a time (with the obvious penaties).
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
| [reply] [d/l] [select] |
Re: *Fastest* way to print a hash sorted by value
by Anonymous Monk on Aug 09, 2003 at 02:05 UTC
|
| [reply] |
|
|
| [reply] |