Whilst I agree that the algorithm is limited to numerics, (the triump of the specific--the OP's "positive integers"--over the generic), I disagree with your big O assessment.
It perfectly feasible to pre-allocate the bitstrings to their largest possible size (2^32/8 = 512MB each). Easily and relativly quickly achieved on a fully populated 32-bit machine. But more importantly, it only need be done once regardless of the size of the input arrays, so it is an O(2) constant factor, and is disregarded by big-O.
Of course, allocating huge chunks of ram for small input arrays means that the constant factor would be proportionally quite large, but that's the limitation of big-O.
Conversely, the hashes used by the other algorithms grow by a process of repeated reallocation in geometrically increasing chunks, which means the time cost grows geometrically with size of the input arrays. In addition, the grep solutions have to build stack-bound lists from each of the arrays.
So, if you are going to try and account for the memory allocation in assesment of the algorithms, then you would have to come up with something like:
O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + SIGMA( 2^3+...+2^m >M ) + MIN(N,M
+) )
memory for 1st hash memory for 2nd hash memory
+for 1 array*
## (*assuming you chose the smallest array to use with the 2hash/1 lis
+t method)
or
O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + N + M) )
memory for smallest hash memory for the two arrays
## for the 1 hash/2 arrays method.
All of which reenforces my stance that big-O is rarely done properly outside of academic papaers, and is of little use unless it is. A benchmark is more instructive. This run uses arrays of 1 million apiece, with the numbers in the range 0 .. 2^27:
C:\test>unionIntersect.pl -N=1e6 -MAX=27
s/iter barvin roy buk buk2
barvin 8.58 -- -43% -58% -64%
roy 4.86 77% -- -26% -37%
buk 3.59 139% 35% -- -15%
buk2 3.05 182% 60% 18% --
I couldn't go higher than 2^27 because of the need to have enough memory in the process to accommodate the large hashes and lists created by the other algorithms as well. Of course, you do have to go to quite large input arrays before that constant factor for allocating the bitstrings amortises:
C:\test>unionIntersect.pl -N=5e5 -MAX=27
s/iter barvin buk roy buk2
barvin 3.87 -- -28% -40% -50%
buk 2.80 39% -- -17% -30%
roy 2.33 66% 20% -- -16%
buk2 1.95 98% 43% 19% --
BTW: Buk2 simply uses RoyJohnstone's trick of only counting the bits for one solution and deriving the other number with arithmetic:
sub buk2{
my( $aRef, $bRef ) = @_;
my( $aBits, $bBits );
for( $aBits, $bBits ) {
local $\;
open my $ram, '>', \$_;
seek $ram, $MAX-1, 0;
print $ram chr(0);
close $ram;
}
vec( $aBits, $_, 1 ) = 1 for @$aRef;
vec( $bBits, $_, 1 ) = 1 for @$bRef;
my $iCount = unpack '%32b*', ( $aBits & $bBits );
my $uCount = @$aRef + @$bRef - $iCount;
$aBits = $bBits = '';
print 'buk2: ', $iCount, ' : ', $uCount if $SHOW;
return( $iCount, $uCount );
}
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|