Re^2: Fast, Efficient Union and Intersection on arrays

Replies are listed 'Best First'.
Re^3: Fast, Efficient Union and Intersection on arrays by BrowserUk (Patriarch) on Nov 21, 2008 at 01:02 UTC
Whilst I agree that the algorithm is limited to numerics, (the triump of the specific--the OP's "positive integers"--over the generic), I disagree with your big O assessment. It perfectly feasible to pre-allocate the bitstrings to their largest possible size (2^32/8 = 512MB each). Easily and relativly quickly achieved on a fully populated 32-bit machine. But more importantly, it only need be done once regardless of the size of the input arrays, so it is an O(2) constant factor, and is disregarded by big-O. Of course, allocating huge chunks of ram for small input arrays means that the constant factor would be proportionally quite large, but that's the limitation of big-O. Conversely, the hashes used by the other algorithms grow by a process of repeated reallocation in geometrically increasing chunks, which means the time cost grows geometrically with size of the input arrays. In addition, the grep solutions have to build stack-bound lists from each of the arrays. So, if you are going to try and account for the memory allocation in assesment of the algorithms, then you would have to come up with something like: `O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + SIGMA( 2^3+...+2^m >M ) + MIN(N,M +) ) memory for 1st hash memory for 2nd hash memory +for 1 array* ## (assuming you chose the smallest array to use with the 2hash/1 lis +t method)` [download] or `O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + N + M) ) memory for smallest hash memory for the two arrays ## for the 1 hash/2 arrays method.` [download] All of which reenforces my stance that big-O is rarely done properly outside of academic papaers, and is of little use unless it is. A benchmark is more instructive. This run uses arrays of 1 million apiece, with the numbers in the range 0 .. 2^27: `C:\test>unionIntersect.pl -N=1e6 -MAX=27 s/iter barvin roy buk buk2 barvin 8.58 -- -43% -58% -64% roy 4.86 77% -- -26% -37% buk 3.59 139% 35% -- -15% buk2 3.05 182% 60% 18% --` [download] I couldn't go higher than 2^27 because of the need to have enough memory in the process to accommodate the large hashes and lists created by the other algorithms as well. Of course, you do have to go to quite large input arrays before that constant factor for allocating the bitstrings amortises: `C:\test>unionIntersect.pl -N=5e5 -MAX=27 s/iter barvin buk roy buk2 barvin 3.87 -- -28% -40% -50% buk 2.80 39% -- -17% -30% roy 2.33 66% 20% -- -16% buk2 1.95 98% 43% 19% --` [download] BTW: Buk2 simply uses RoyJohnstone's trick of only counting the bits for one solution and deriving the other number with arithmetic: `sub buk2{ my( $aRef, $bRef ) = @_; my( $aBits, $bBits ); for( $aBits, $bBits ) { local $\; open my $ram, '>', \$_; seek $ram, $MAX-1, 0; print $ram chr(0); close $ram; } vec( $aBits, $_, 1 ) = 1 for @$aRef; vec( $bBits, $_, 1 ) = 1 for @$bRef; my $iCount = unpack '%32b', ( $aBits & $bBits ); my $uCount = @$aRef + @$bRef - $iCount; $aBits = $bBits = ''; print 'buk2: ', $iCount, ' : ', $uCount if $SHOW; return( $iCount, $uCount ); }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Fast, Efficient Union and Intersection on arrays by ptoulis (Scribe) on Nov 21, 2008 at 02:40 UTC
I will be quick cuz it is 4am here. I was referring to running time for the big-O complexity. Space complexity is a different measure but even if we do count it, the complexity of the algorithms is still O(m+n). This is because the key idea in the algorithms is that we traverse the 1st array, create a hash and then test the hash against the 2nd array, which for arrays with m and n elements is pm+qn, p,q constants, or better O(m+n). If we were not to use hashes the running time would be O(mn) because we would have 2 nested loops. Also,the space allocations are not ~2^n but ~logN which is ignored in the O(m+n) notation. I will strongly disagree* with your statement about big-O notation being done only in academic papers. The notation does not tell you anything about the actual running time of the algorithm, but it is a hint on how the complexity scales with the input. Real benchmarks are most important but they do not replace mathematical analysis but supplement it. In simple terms O(mn) indicates 2 nested loops, while O(m+n) indicates 2 independent loops. That's it. There is no reference to actual time. Complexity theory is a real treasure when we can apply it properly. Actually, your method is a variation of a Bloom Filter* which is known for being very fast. To overcome the O(m+n) a special structure like a rooted tree(heap) is needed, because there is no better solution. This paper (PDF) is one of he many proofs that the union problem is linear on the input. Finally the new version `buk2()` may be faster but drops a key advantage of the previous, which is the ability to handle duplicate array elements. Still your algorithm is the killer one and I haven't think of anything better. That's all for now. I am going for some sleep and come back tomorrow, hopefully with some code on the problem. Till then I will use big-O notation to count the sheep!	[reply] [d/l]
Re^5: Fast, Efficient Union and Intersection on arrays by BrowserUk (Patriarch) on Nov 21, 2008 at 08:25 UTC
I agree with you that trying to use big-O to compare the time-costs of space complexity is pretty pointless. I was simply following moritz lead with the purpose of trying to make that very point. I will strongly disagree with your statement about big-O notation being done only in academic papers. That's not what I said. I said: "that big-O is rarely done properly outside of academic papers, which is quite different. Feel free to disagree with what I actually said also, but you might want to read on first. What I said is not a million miles away from your own later statement:"Complexity theory is a real treasure when we can apply it properly.". The problem is that people will insist on trying to use the analysis of algorithm complexity to compare the time-costs of actual implementations. It simply doesn't work for a variety of reasons. It ignores the reality that in real-world timeclock seconds, a good implementation can be several orders of magnitude better than a bad implementation of exactly the same algorithm. It ignores that an efficient implementation of a "bad algorithm" can trounce an inefficient implementation of a "good" one. It ignores that for some tasks, there are simply no better algorithms (known), and so implementation efficiency is the only game in town. For example, there is currently no known better algorithm for cracking RC4 than brute force, and you are never going to challenge 6-seconds in C with anything written in Perl. As an experiment, I tried calling a C implementation of RC4 from an loop iterating in Perl, and just the costs of moving back and fourth from perl to C, meant that this would take days rather than seconds. (There is an assumption here that juster did actually solve the problem in 6-seconds which I failed to replicate!) It ignores the reality that unlike (say), a Turing Machine with an infinite tape, that on real systems, resources are finite and have time-costs associated with using them. Memory is finite and takes time to allocate, initialise, populate and free. In your own assessments of the algorithms in this thread, you are ignoring the time-costs of insertion and lookup in the hashes. Presumably on the basis of the well-known "hashes are O(1)". But in the real world of Perl (and everywhere else also), hashes: grow geometrically; they have collisions; and they are notoriously bad, in big-o terms, when measured at the implementation level, at exploiting cache coherency. Hence a linear search, and algorithms (like B+ trees that exploit localised linear searching at the lower levels) can beat out hashes for some applications. In the real world. Big-O notation is useful for comparing specific algorithm for achieving specific, and usually very limited, tasks. The problem is that real world tasks tend to be more complex in that they need the application of several specific algorithms, serially or in parallel, to achieve the final goal. And trying to combine the big-O notations of all the sub algorithms, is a task often far more complex that writing the program and benchmrking it. Finally, big-O ignores the reality that in programmer terms, writing two nested loops is no harder than writing to independant loops. And that for small numbers, the difference in run time can be negligable. For example, the OP benchmarked my implementation (not algorithm; moritz had mentioned the possibility of it before I posted), of the vec solution against Roy Johnston's implementation, and his own, on his real-world data sets consisting of "modest sized arrays" of "non-duplicate, positive integers", and found Roy Johnstone's the best for his purposes. And that, as they say, is that. Out of personal interest, I added Roy's implementation to my own benchmark and found that it beat mine (in it's original, non-pre-extending form) when the number of elements in the arrays were less than 30. In the pre-extending version, this rises to 500,000. With the OP's 'modest size' obviously being less than 30, my implementation is a non starter in either form. And, for the OP's specific requirements, all the big-O analysis of algorithms in the world isn't going to change that. In the real world, big-O is no substitute for benchmarks of actual implementations using actual data. Which makes this whole subthread, like big-O notation, academic. I also completely disagree with your assessment that my implementation is "a variation of a Bloom Filter". The key element of a bloom filter is that it is a probabalistic solution subject to false positive possibilities. My implementation does not suffer this limitation, it is completely determanistic. If you want to categorise it, then it is a brute force lookup table. Nothing more. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^6: Fast, Efficient Union and Intersection on arrays by ptoulis (Scribe) on Nov 21, 2008 at 11:53 UTC
Re^7: Fast, Efficient Union and Intersection on arrays by BrowserUk (Patriarch) on Nov 21, 2008 at 12:21 UTC
Re^4: Fast, Efficient Union and Intersection on arrays by moritz (Cardinal) on Nov 21, 2008 at 07:16 UTC
It perfectly feasible to pre-allocate the bitstrings to their largest possible size That's cheating, in some sense. If you consider all limitation of real machines, you will find that perl can only ever handle a fixed, finite amount of RAM (if the pointers are 64 bit, it's `2**64`), every list is limited by that, so every of the mentioned algorithms runs in O(1). Just with a very big constant. That's why you usually rely on the model of an ideal machines (works with bigints, and arbitrarily much RAM) for the run time analysis.	[reply] [d/l]