in reply to Limit the size of a hash
If it is just a randomly selected 10 items with the highest value you are after, beware of 'add one and (re)sort and discard' algorithms.
O(logN) sorting algorithms are pretty efficient for large sets of data, but (relatively) pretty lousy at small sets.
Eg. log(100e6) = 18.42 => 0.000018%; but log(10) = 23%.
So, given your ~715e6 records, the 'add 1, resort and discard 1' algorithms do 715 million * O(log 11) sorts, which is ~= O(745e6).
Conversely, if you add 1000 items, sort and discard 1000, you get 715 thousand * O(log 1010) sorts, which is ~= O(10e6).
And (theoretically), if you add 1e6 items, sort, and disarcd 1e6 then you get 715 * O(log 1000010) sorts, which is ~= O(9878).
Of course, these CompSci theoretical complexity estimates tend to ignore the cost of memory allocations, deallocations, etc. and thus never work out that way in reality, but they do give an indication of where to look for savings.
This compares Add 1 and Add many algorithms for various values of many, and adding somewhere between 100 and 1000 items at a time seems to work best for an entirely in-memory, cpu-bound process:
#! perl -slw use strict; use List::Util qw[ sum min ]; use Time::HiRes qw[ time ]; sub timeit(&) { my $soFar = sum( times() ); my $start = time; $_[0]->(); return sprintf "Wall:%6.3f secs; cpu:%6.3f secs", time() - $start, sum( times() ) - $soFar; } our $T //= 10e6; our $N //= 10; our $B //= 10; my @data = map{ sprintf "%4.2f", rand() } 1 .. 10e6; print 'Add 1 and sort; add 1 and sort ...: ', timeit { my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ]; for( my $i = $N; $i < $T; ++$i ) { @top = ( sort{ $b <=> $a } @top, $data[ $i ] )[ 0 .. $N-1 ]; } # print "@top"; }; print "Add $B and sort; add $B and sort ...: ", timeit{ my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ]; for( my $i = $N; $i < $T; $i += $B ) { @top = sort{ $b <=> $a } @top, @data[ $i .. min( $i+$B, $#data + ) ]; $#top = $N; } # print "@top"; }; __END__ C:\test>1052498-3 -T=10e6 -B=1e1 Add 1 and sort; add 1 and sort ...: Wall:89.400528 secs; cpu:89.04 +7000 secs Add 1e1 and sort; add 1e1 and sort ...: Wall:18.483192 secs; cpu:18.39 +1000 secs C:\test>1052498-3 -T=10e6 -B=1e2 Add 1 and sort; add 1 and sort ...: Wall:87.353784 secs; cpu:86.43 +8000 secs Add 1e2 and sort; add 1e2 and sort ...: Wall:9.385142 secs; cpu:9.2340 +00 secs C:\test>1052498-3 -T=10e6 -B=2e2 Add 1 and sort; add 1 and sort ...: Wall:87.380 secs; cpu:87.046 s +ecs Add 2e2 and sort; add 2e2 and sort ...: Wall: 9.135 secs; cpu: 9.156 s +ecs C:\test>1052498-3 -T=10e6 -B=1e3 Add 1 and sort; add 1 and sort ...: Wall:87.436786 secs; cpu:86.62 +5000 secs Add 1e3 and sort; add 1e3 and sort ...: Wall:9.377329 secs; cpu:9.3290 +00 secs C:\test>1052498-3 -T=10e6 -B=1e4 Add 1 and sort; add 1 and sort ...: Wall:87.435 secs; cpu:86.298 s +ecs Add 1e4 and sort; add 1e4 and sort ...: Wall:10.077 secs; cpu: 9.843 s +ecs
Now, if someone decides to try that where the data is greater than memory and so coming from an external file, they'll probably find that the IO-limited nature means that 1 or 1000 at a time make little or no difference to the overall elapsed time.
But, if you then look at the CPU used, the 1 at a time will be an order of magnitude, or more, higher. And cycles cost money. There is little purpose in burning extra cycles to achieve the same outcome in the same time.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms)
by hdb (Monsignor) on Sep 05, 2013 at 14:43 UTC | |
by BrowserUk (Patriarch) on Sep 05, 2013 at 15:20 UTC | |
|
Re^2: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms)
by Limbic~Region (Chancellor) on Sep 05, 2013 at 14:57 UTC | |
by BrowserUk (Patriarch) on Sep 05, 2013 at 15:25 UTC |