in reply to Best Pairs
If the number of discrete values that can appear in your small arrays is less than 32 -- they do not have to be contiguous -- then you can represent each array by a 32-bit number, with each bit representing one of the numbers. This reduces the space required for each million arrays of up to 32 numbers from around 900MB to 4MB.
...101001010010001 ||||| ||||+- 1 |||+-- 16 |||+-- 75 ||+--- 204 |+---- 205 +----- 206 etc.
This also allows the 528 possible pairs to each be represented by a single, 32-bit number with 2-bits set.
Counting the pairs becomes a process of iterating the 'arrays' and masking them against the pairs, stored as keys in a hash,and incrementing the values.
Finding the top three counts in the resultant hash is easy. This process reduces the problem to O(number of arrays), and on my machine allows them to count the pairs at a rate of a little over 1 second per thousand, or around fifteen minutes/1 million with a storage requirement of 4MB per million + plus startup.
#! perl -slw use strict; use List::Util qw[ max ]; use Benchmark::Timer; my $T = new Benchmark::Timer; our $ARRAYS ||= 10; our $K ||= 3; # Generate all the possible combinations of pairs # And create a hash with them as the keys. my %pairs; for my $n ( 0 .. 19 ){ $pairs{ 1<<$n | 2<<$_ } = 0 for $n .. 19; }; # Generate an 'array' of 1 million 'arrays' with up to 20 # numbers in each. my $arrays = ''; vec( $arrays, $_, 32 ) = int rand( 1 << 19 ) for 1 .. $ARRAYS; # Count the pairs $T->start( 'Count the pairs' ); for my $pair ( keys %pairs ) { for ( 0 .. $ARRAYS-1 ) { $pairs{ $pair }++ if ( vec( $arrays, $_, 32 ) & $pair ) == $pa +ir; } } $T->stop( 'Count the pairs' ); # Find the top K pairs $T->start( 'Find the top K' ); my @topK = ( keys %pairs )[ 1 .. 3 ]; for my $pair ( keys %pairs ) { if( $pairs{ $pair } > max( @pairs{ @topK } ) ) { push @topK, $pair; shift @topK if @topK > 3; } } $T->stop( 'Find the top K' ); # Display the results for my $k ( @topK ) { printf 'Pair [ %d, %d ] ', grep{ $k & (1<<$_) ? $_ : () } 0 .. 19; printf 'appeared %d times'. $/, $pairs{ $k }; } $T->report; print 'Record memory usage';<stdin>; __END__ P:\test>305470 -ARRAYS=10000 -K=3 Pair [ 12, 14 ] appeared 2521 times Pair [ 4, 17 ] appeared 2579 times Pair [ 7, 17 ] appeared 2582 times 1 trial of Count the pairs (10.315s total) 1 trial of Find the top K (10.014ms total) Record memory usage 2772k P:\test>305470 -ARRAYS=100000 -K=3 Pair [ 4, 17 ] appeared 25165 times Pair [ 7, 16 ] appeared 25190 times Pair [ 7, 11 ] appeared 25348 times 1 trial of Count the pairs (106.954s total) 1 trial of Find the top K (10.015ms total) Record memory usage 3124k P:\test>305470 -ARRAYS=200000 -K=3 Pair [ 9, 17 ] appeared 50261 times Pair [ 7, 13 ] appeared 50356 times Pair [ 9, 13 ] appeared 50386 times 1 trial of Count the pairs (213.307s total) 1 trial of Find the top K (10.014ms total) Record memory usage 3524k
|
|---|