Re: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms)

If it is just a randomly selected 10 items with the highest value you are after, beware of 'add one and (re)sort and discard' algorithms.

O(logN) sorting algorithms are pretty efficient for large sets of data, but (relatively) pretty lousy at small sets.

Eg. log(100e6) = 18.42 => 0.000018%; but log(10) = 23%.

So, given your ~715e6 records, the 'add 1, resort and discard 1' algorithms do 715 million * O(log 11) sorts, which is ~= O(745e6).

Conversely, if you add 1000 items, sort and discard 1000, you get 715 thousand * O(log 1010) sorts, which is ~= O(10e6).

And (theoretically), if you add 1e6 items, sort, and disarcd 1e6 then you get 715 * O(log 1000010) sorts, which is ~= O(9878).

Of course, these CompSci theoretical complexity estimates tend to ignore the cost of memory allocations, deallocations, etc. and thus never work out that way in reality, but they do give an indication of where to look for savings.

This compares Add 1 and Add many algorithms for various values of many, and adding somewhere between 100 and 1000 items at a time seems to work best for an entirely in-memory, cpu-bound process:

#! perl -slw
use strict;
use List::Util qw[ sum min ];
use Time::HiRes qw[ time ];

sub timeit(&) {
    my $soFar = sum( times() );
    my $start = time;
    $_[0]->();
    return sprintf "Wall:%6.3f secs; cpu:%6.3f secs",
        time() - $start, sum( times() ) - $soFar;
}

our $T //= 10e6;
our $N //= 10;
our $B //= 10;

my @data = map{ sprintf "%4.2f", rand() } 1 .. 10e6;

print 'Add 1   and sort; add 1   and sort ...: ', timeit {
    my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ];

    for( my $i = $N; $i < $T; ++$i ) {
        @top = ( sort{ $b <=> $a } @top, $data[ $i ] )[ 0 .. $N-1 ];
    }
#    print "@top";
};

print "Add $B and sort; add $B and sort ...: ", timeit{
    my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ];

    for( my $i = $N; $i < $T; $i += $B ) {
        @top = sort{ $b <=> $a } @top, @data[ $i .. min( $i+$B, $#data
+ ) ];
        $#top = $N;
    }
#    print "@top";
};

__END__
C:\test>1052498-3 -T=10e6 -B=1e1
Add 1   and sort; add 1   and sort ...: Wall:89.400528 secs; cpu:89.04
+7000 secs
Add 1e1 and sort; add 1e1 and sort ...: Wall:18.483192 secs; cpu:18.39
+1000 secs

C:\test>1052498-3 -T=10e6 -B=1e2
Add 1   and sort; add 1   and sort ...: Wall:87.353784 secs; cpu:86.43
+8000 secs
Add 1e2 and sort; add 1e2 and sort ...: Wall:9.385142 secs; cpu:9.2340
+00 secs

C:\test>1052498-3 -T=10e6 -B=2e2
Add 1   and sort; add 1   and sort ...: Wall:87.380 secs; cpu:87.046 s
+ecs
Add 2e2 and sort; add 2e2 and sort ...: Wall: 9.135 secs; cpu: 9.156 s
+ecs

C:\test>1052498-3 -T=10e6 -B=1e3
Add 1   and sort; add 1   and sort ...: Wall:87.436786 secs; cpu:86.62
+5000 secs
Add 1e3 and sort; add 1e3 and sort ...: Wall:9.377329 secs; cpu:9.3290
+00 secs

C:\test>1052498-3 -T=10e6 -B=1e4
Add 1   and sort; add 1   and sort ...: Wall:87.435 secs; cpu:86.298 s
+ecs
Add 1e4 and sort; add 1e4 and sort ...: Wall:10.077 secs; cpu: 9.843 s
+ecs
[download]

Now, if someone decides to try that where the data is greater than memory and so coming from an external file, they'll probably find that the IO-limited nature means that 1 or 1000 at a time make little or no difference to the overall elapsed time.

But, if you then look at the CPU used, the 1 at a time will be an order of magnitude, or more, higher. And cycles cost money. There is little purpose in burning extra cycles to achieve the same outcome in the same time.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) Download Code

Replies are listed 'Best First'.
Re^2: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by hdb (Monsignor) on Sep 05, 2013 at 14:43 UTC
Yes, but if you first compare the new item with the smallest of the 10 already stored, you will very rarely add it to the set of 10 and re-sort. This would be a "conditionally discard, add, and re-sort" algorithm.	[reply]
Re^3: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by BrowserUk (Patriarch) on Sep 05, 2013 at 15:20 UTC
Yes. Your hand-coded insertion sort fairs much better than those using the built-in sort. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by Limbic~Region (Chancellor) on Sep 05, 2013 at 14:57 UTC
BrowserUk, This reminds me of Better mousetrap (getting top N values from list X). Cheers - L~R	[reply]
Re^3: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms) by BrowserUk (Patriarch) on Sep 05, 2013 at 15:25 UTC
Indeed. But I'm still not at all convinced that the OP is actually after a simple top 10 from 715 million records. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]