Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Fast, Efficient Union and Intersection on arrays

by barvin (Beadle)
on Nov 20, 2008 at 16:31 UTC ( [id://724918]=perlquestion: print w/replies, xml ) Need Help??

barvin has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm doing graph traversal and calculating the Jaccard distance on each pair of vertices. Calculating intersection and union are consuming 98% of my processing time.

I've tried Set::IntSpan, Set::Array and List::Compare, but they are all even slower than my own code. Is there a VERY fast (maybe implemented in C) way to calculate intersection and union on two arrays? Just for interest, my implementation is ($in and $jn are array refs of positive integers):

sub get_int_uni {

        my ($in, $jn) = @_;
        
        my (%int, %uni);
        for my $i (@$in) {
                $uni{$i}++;
        }
        for my $j (@$jn) {
                $int{$j}++ if $uni{$j};
                $uni{$j}++;
        }
        return ((scalar keys %int), (scalar keys %uni));
}

Replies are listed 'Best First'.
Re: Fast, Efficient Union and Intersection on arrays
by BrowserUk (Patriarch) on Nov 20, 2008 at 18:03 UTC

    Hm. I think your routine is broken? I get these results:

    c:\test>unionIntersect -N=10 -SHOW 2 2 3 4 5 5 8 9 12 14 4 4 7 7 9 9 10 10 11 11 barvin: 5 : 11 buk: 2 : 11

    The union contains 11, and the intersection is only 2 (4,9) but your routine reports 5?

    Anyway, this is my benchmark which shows a vec implementation is quite a bit faster:

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $N ||= 100; our $SHOW; sub barvin { my ($in, $jn) = @_; my (%int, %uni); for my $i (@$in) { $uni{$i}++; } for my $j (@$jn) { $int{$j}++ if $uni{$j}; $uni{$j}++; } print 'barvin: ', ((scalar keys %int), ' : ', (scalar keys %u +ni)) if $SHOW; return ((scalar keys %int), (scalar keys %uni)); } sub buk{ my( $aRef, $bRef ) = @_; my( $aBits, $bBits ) = ('') x 2; vec( $aBits, $_, 1 ) = 1 for @$aRef; vec( $bBits, $_, 1 ) = 1 for @$bRef; my $uCount = unpack '%32b*', ( $aBits | $bBits ); my $iCount = unpack '%32b*', ( $aBits & $bBits ); print 'buk: ', $iCount, ' : ', $uCount if $SHOW; return( $iCount, $uCount ); } our @a = map int( rand $N * 1.5 ), 1 .. $N; our @b = map int( rand $N * 1.5 ), 1 .. $N; print "@{[ sort { $a <=> $b } @a ]}\n@{[ sort{ $a <=> $b } @b ]}\n" if + $SHOW; cmpthese $SHOW ? 1 : -1, { barvin => q[ our( @a, @b ); barvin( \@a, \@b ); ], buk => q[ our( @a, @b ); buk( \@a, \@b ); ], }; __END__ c:\test>unionIntersect -N=10 Rate barvin buk barvin 33212/s -- -29% buk 47045/s 42% -- c:\test>unionIntersect -N=100 Rate barvin buk barvin 4111/s -- -53% buk 8777/s 113% -- c:\test>unionIntersect -N=1000 Rate barvin buk barvin 357/s -- -64% buk 998/s 179% --

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Ah yes, I'm assuming unique elements in the two hashes, which is correct for my data. Thanks very much for your suggestions. These vectors are new to me, so some good stuff for me to chew on here.
Re: Fast, Efficient Union and Intersection on arrays
by Roy Johnson (Monsignor) on Nov 20, 2008 at 18:07 UTC
    I note two things: you only want the number of elements in the intersection and union, not the actual intersection and union; and your method suggests that there are no duplicates in the sets. So you don't need to find the union, explicitly. It is just the total number in I + J, minus the intersection.
    sub get_int_uni2 { my ($in, $jn) = @_; my %i; @i{@$in} = (); my $int = grep exists $i{$_}, @$jn; return ($int, @$in + @$jn - $int); }
    Sadly, I don't see any real improvement in performance (thanks to BrowserUK for the benchmarkery).

    Caution: Contents may have been coded under pressure.
Re: Fast, Efficient Union and Intersection on arrays
by moritz (Cardinal) on Nov 20, 2008 at 16:45 UTC
    If you know that each array itself doesn't contain duplicate elements, you could try something along these lines:
    my (@a, @b); # initialize @a and @b here my %hash: @hash{@a} = (1) x @a; my @union = @a, grep { !$hash{$_} } @b; my @intersection = grep { $hash{$_} } @b;

    That has the advantage of using only one hash, which might buy you some performance. But I don't know if it is actually faster than anything else.

    (If you provide a small Benchmark with lists of typical sizes you might get much better answers; for example for a large number of small integers it might be more memory efficient to use vec than a hash, which might in turn result in some performance benefit.)

      It seems counterproductive to grep the b array twice. But it's actually faster than using one for loop.

      I did a benchmark, and there is a 20% speed boost for doing two greps over just one for loop with short arrays (less than 10). The boost shrinks as the arrays grow in size. At array size ~50 boost is only about 8%.

      use strict; use warnings; use Benchmark qw(cmpthese clearallcache); my @a = (1..50); my @b = (3..80); #two_greps(); #one_for(); #exit; print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n"; cmpthese( 100_000, { two_greps => \&two_greps, one_for => \&one_for, }); clearallcache(); @a = (1..5); @b = (3..8); print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n"; cmpthese( 100_000, { two_greps => \&two_greps, one_for => \&one_for, }); sub two_greps { my %hash; @hash{@a} = (1) x @a; my @union = @a, grep { !$hash{$_} } @b; my @intersection = grep { $hash{$_} } @b; # print "union @union\n"; # print "interstection @intersection\n"; } sub one_for { my %hash; @hash{@a} = (1) x @a; my @union = @a; my @intersection; foreach (@b) { if( exists $hash{$_} ) { push @intersection, $_ } else { push @union, $_ } } # print "union @union\n"; # print "interstection @intersection\n"; } __DATA__ Range (1,50) vs (3,80) Rate one_for two_greps one_for 5829/s -- -8% two_greps 6318/s 8% -- Range (1,5) vs (3,8) Rate one_for two_greps one_for 57143/s -- -17% two_greps 68823/s 20% --

      Thanks for posting this thought provoking code.


      TGI says moo

        I remember one time getting a noticeable speed improvement by using the 'for (...;...;...)' over a foreach loop for a large array. I wondering if that would have helped here.

        But there is flaw in grep-algorithm: it does not union correctly, if first array is shorter than second. If you consider this, then the for-algorithm will about 6-7% faster.

        Nġnda, WK
Re: Fast, Efficient Union and Intersection on arrays
by barvin (Beadle) on Nov 20, 2008 at 19:08 UTC
    Thanks all for a fabulous lessons in efficiency and creativity. I implemented BrowserUK's vector routine, and Roy Johnson's single grep routine - and at 100,000 iterations with moderately sized arrays (mean 151, stdev 185) the winner is:
                 Wall Time
    barvin       54.134s
    BrowserUK    26.905s
    Roy Johnson  21.349s
    

    So for those coming in by Google...

    To calculate the count of intersection and union on two arrays (used by me to calculate the Jaccard Distance on the edges of a graph), pass two array references that have no internal duplicates to Roy's subroutine below. Thanks Roy.

    sub get_int_uni2 {
      my ($in, $jn) = @_;
      my %i;
      @i{@$in} = ();
      my $int = grep exists $i{$_}, @$jn;
      return ($int, @$in + @$jn - $int);
    }
    
      If you can maintain sorted structures, which is painful in Perl, then perhaps you can speed up more.
Re: Fast, Efficient Union and Intersection on arrays
by JavaFan (Canon) on Nov 20, 2008 at 21:43 UTC
    Is there a VERY fast (maybe implemented in C) way to calculate intersection and union on two arrays?
    You may want to have a look at PDL. And, no doubt, the fastest routines have probably been written in the early 60s, in FORTRAN. But I wouldn't know how to link FORTRAN routines to perl.
Re: Fast, Efficient Union and Intersection on arrays
by ptoulis (Scribe) on Nov 20, 2008 at 22:44 UTC
    That was an excellent exercise barvin. As always, the speed in the algorithm depends in the nature of your problem (the input).

    All the algorithms presented above have theoretically a running time of O(m+n) where m,n are the sizes of the lists. Now, all the great ideas above differ much in the implementation with respect to the Perl internals and most are huge space consumers.

    In terms of speed the best algorithm, is by far from BrowserUK(buk() function). In my Windows machine it is insanely fast compared to others. However, it has some important weaknesses. First, it can't deal with non-numeric data, since the vec needs a numeric offset which in buk() the list elements are used. Second, it fails with arrays with large numbers as element, e.g. if you set @a=(1,2) and @b=(3,4,10000000000) the program will die with the message Negative offset to vec in lvalue context at ....

    On the other hand all other algorithms do not have the above weaknesses (can deal with large numbers, can handle non-numeric data), but still are much slower and also fail when arrays have duplicated elements.

    In summary, I guess there is no absolute truth when we speak about algorithms. The algorithms here are running in O(m+n) which means that they are affected by the larger array. To beat this, use some high-order data structure,like the Heap. I didn't test it but if l=min(m,n) and r=max(m,n), you could reach O(l*log(r)). You have an extra overhead when inserting values(or removing) but still search is lightning fast.
      All the algorithms presented above have theoretically a running time of O(m+n) where m,n are the sizes of the lists.

      Not true. Since BrowserUK's algorithm needs to allocate and initialize a string with max(@a, @b) bits, it has O(n + m + max(@a) + max(@b)) run time (you mentioned another problem that arises from large numbers, but not in terms of complexity).

        Whilst I agree that the algorithm is limited to numerics, (the triump of the specific--the OP's "positive integers"--over the generic), I disagree with your big O assessment.

        It perfectly feasible to pre-allocate the bitstrings to their largest possible size (2^32/8 = 512MB each). Easily and relativly quickly achieved on a fully populated 32-bit machine. But more importantly, it only need be done once regardless of the size of the input arrays, so it is an O(2) constant factor, and is disregarded by big-O.

        Of course, allocating huge chunks of ram for small input arrays means that the constant factor would be proportionally quite large, but that's the limitation of big-O.

        Conversely, the hashes used by the other algorithms grow by a process of repeated reallocation in geometrically increasing chunks, which means the time cost grows geometrically with size of the input arrays. In addition, the grep solutions have to build stack-bound lists from each of the arrays.

        So, if you are going to try and account for the memory allocation in assesment of the algorithms, then you would have to come up with something like:

        O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + SIGMA( 2^3+...+2^m >M ) + MIN(N,M +) ) memory for 1st hash memory for 2nd hash memory +for 1 array* ## (*assuming you chose the smallest array to use with the 2hash/1 lis +t method)

        or

        O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + N + M) ) memory for smallest hash memory for the two arrays ## for the 1 hash/2 arrays method.

        All of which reenforces my stance that big-O is rarely done properly outside of academic papaers, and is of little use unless it is. A benchmark is more instructive. This run uses arrays of 1 million apiece, with the numbers in the range 0 .. 2^27:

        C:\test>unionIntersect.pl -N=1e6 -MAX=27 s/iter barvin roy buk buk2 barvin 8.58 -- -43% -58% -64% roy 4.86 77% -- -26% -37% buk 3.59 139% 35% -- -15% buk2 3.05 182% 60% 18% --

        I couldn't go higher than 2^27 because of the need to have enough memory in the process to accommodate the large hashes and lists created by the other algorithms as well. Of course, you do have to go to quite large input arrays before that constant factor for allocating the bitstrings amortises:

        C:\test>unionIntersect.pl -N=5e5 -MAX=27 s/iter barvin buk roy buk2 barvin 3.87 -- -28% -40% -50% buk 2.80 39% -- -17% -30% roy 2.33 66% 20% -- -16% buk2 1.95 98% 43% 19% --

        BTW: Buk2 simply uses RoyJohnstone's trick of only counting the bits for one solution and deriving the other number with arithmetic:

        sub buk2{ my( $aRef, $bRef ) = @_; my( $aBits, $bBits ); for( $aBits, $bBits ) { local $\; open my $ram, '>', \$_; seek $ram, $MAX-1, 0; print $ram chr(0); close $ram; } vec( $aBits, $_, 1 ) = 1 for @$aRef; vec( $bBits, $_, 1 ) = 1 for @$bRef; my $iCount = unpack '%32b*', ( $aBits & $bBits ); my $uCount = @$aRef + @$bRef - $iCount; $aBits = $bBits = ''; print 'buk2: ', $iCount, ' : ', $uCount if $SHOW; return( $iCount, $uCount ); }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast, Efficient Union and Intersection on arrays
by ptoulis (Scribe) on Nov 21, 2008 at 21:26 UTC
    I am dumping my own code here which is for finding the union/intersection of arrays in-place. It uses mergesort to order the lists(The Heap module from CPAN is of low-quality so I had to use the built-in mergesort for sorting..) and then a divide-and-conquer to find the intersection and union. It is about 2 times slower than uni_int_2 but it is as much efficient in space requirements because it does not allocate any more memory than the input arrays (of course I can't compare it with the buk(). For arrays of 500,000 elements it takes 4-5 seconds in my machine, and RAM consumption is 50MB which is what Perl allocates for the initial arrays (and of course the runtime).

    I didn't try to optimize it much, perhaps someone could throw an idea here. Sorry for the big file, but I have some comments inline that might help anyone interested.

    use warnings; use strict; use List::Util 'shuffle'; use sort '_mergesort'; #create the !$ARGV[0] and print "No input for array size" and exit; my @array1 = shuffle 1..$ARGV[0]; my @array2 = shuffle 1..$ARGV[0]; my @sortedA = sort {$a<=>$b} (@array1); my @sortedB = sort {$a<=> $b } (@array2); our $pivot; #used in binary search.. my ($uni, $insc)= p_uni(0,$#sortedA,0,$#sortedB); print "\n|U| = ",$uni," |I|=",$insc; #>>>>> UNION and INTERSECTION <<<<<<<< #The function p_uni() will return an ARRAY for which #ARRAY[0] = cardinality of union #ARRAY[1]=cardinality of intersection sub p_uni { #The function process 2 ordered lists, say A and B as follows #We split A = A1+A2 and B=B1+B2 around the middle #Since they are ordered we know that INTERSECTION(A1,B2)=0 and INTERSE +CTION (A2,B1)=0, so we process them independently... #The split array Ai is A[aLeft,aRight] - a subset of A with those indi +ces. The same holds for bLeft and bRight. my ($aLeft,$aRight,$bLeft, $bRight) = @_; my ($tmp_uni,$tmp_insc)=(0)x 2; #If A array is too small then run a linear search and match if($aRight-$aLeft<100) { my $search_result; for (my $i=$aLeft, my $bLeft_h = $bLeft;$i<=$aRight;$i++) { $search_result = bin_search(\@sortedB, $sortedA[$i],$bL +eft_h,$bRight); $tmp_uni++; $tmp_insc++ if($search_result->[0]); $bLeft_h = $search_result->[1]; } return ($aRight-$aLeft+1-$tmp_insc+$bRight-$bLeft+1, $tmp_insc) +; } #If B array is too small run a linear search and match elsif($bRight-$bLeft<100) { my ($tmp_uni,$tmp_insc)=(0)x 2; my $search_result; for my $i($bLeft..$bRight) { if(defined($sortedB[$i-1]) && $sortedB[$i-1] != $so +rtedB[$i]) { my $search_result = bin_search(\@sortedA, $sortedB[$i] +,$aLeft,$aRight); if($search_result->[0]) {$tmp_uni++; $tmp_insc++;} else { $tmp_uni+=1; } $aLeft = $search_result->[1]; } } return ($aRight-$aLeft+1-$tmp_insc+$bRight-$bLeft+1, $tmp_insc +); } my $bPos = int(($bLeft+$bRight)/2); my $aPos = bin_search(\@sortedA, $sortedB[$bPos],$aLeft,$aRight); my ($a_insc,$a_uni) = p_uni($aLeft, $aPos->[1], $bLeft,$bPos); my ($b_insc,$b_uni) = p_uni($aPos->[1]+1,$aRight,$bPos+1,$bRight); return($a_insc+$b_insc,$a_uni+$b_uni); } #This is a binary search- Searches in the ordered list reference AR_RE +F for WHAT and between and including indices cLEFT cRIGHT.. sub bin_search { my ($ar_ref, $what,$cLeft,$cRight) =@_; my ($left,$right,$value)=($cLeft,$cRight,0); while($left<=$right ) { $pivot = int (($left+$right)/2); $value = $ar_ref->[$pivot]; if($value>$what) { $right=$pivot-1; } elsif($value<$what) { $left=$pivot+1; } else { return [1,$pivot]; } } return [0,$pivot]; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://724918]
Approved by Tanalis
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2024-03-28 11:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found