Re: Comparing two arrays
by BrowserUk (Patriarch) on Dec 15, 2013 at 12:16 UTC
|
My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array.
Convert your arrays of 0s 1s to bit-strings, then use bitwise-& and unpack '%32b*' to count the equivalences and you can do this 300+ times faster than comparing the arrays:
#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 500;
our $I //= -1;
our $N //= 1000;
our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N;
our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N;
our @xStrings = map{ join '', @$_ } @xArrays;
our @yStrings = map{ join '', @$_ } @yArrays;
our @xBits = map{ pack 'b*', $_ } @xStrings;
our @yBits = map{ pack 'b*', $_ } @yStrings;
cmpthese $I, {
array => q[
my %top10s;
for my $x ( 0 .. $#xArrays ) {
for my $y ( 0 .. $#yArrays ) {
my $count = 0;
$xArrays[$x][$_] == 1 && $yArrays[$y][$_] == 1 and ++$
+count for 0 .. $#{ $xArrays[ 0 ] };
$top10s{"$x:$y"} = $count;
my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete $top10s{$discard};
}
}
$I == 1 and pp ' arrays: ', %top10s;
],
strings => q[
my %top10s;
for my $x ( 0 .. $#xStrings ) {
for my $y ( 0 .. $#yStrings ) {
my $count = ( $xStrings[$x] & $yStrings[$y] ) =~ tr[1]
+[];
$top10s{"$x:$y"} = $count;
my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete $top10s{$discard};
}
}
$I == 1 and pp 'strings: ', %top10s;
],
bits => q[
my %top10s;
for my $x ( 0 .. $#xBits ) {
for my $y ( 0 .. $#yBits ) {
my $count = unpack '%32b*', ( $xBits[$x] & $yBits[$y]
+);
$top10s{"$x:$y"} = $count;
my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke
+ys %top10s )[ 0 ];
keys( %top10s ) > 10 and delete $top10s{$discard};
}
}
$I == 1 and pp ' bits: ', %top10s;
],
};
__END__
C:\test>1067218 -N=100
Rate array strings bits
array 1.95e-002/s -- -98% -100%
strings 1.08/s 5417% -- -82%
bits 5.97/s 30510% 455% --
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
thank you so much for the code and the benchmark, after seeing this i'll try to implement the strategy. However what i'm wondering now is where does the speed come from. When I search for a certain bit in a bit-string I remember reading somewhere that the bit is found by iterating through the memory block whereas accessing an array element is constant. is it possible that these constants are so large that it is cheaper to linearly scan through memory blocks or did i mixed up something (Which is probably the case). Could you please educate me a "bit" :)
Thank you
baxy
| [reply] |
|
i'm wondering now is where does the speed come from.
Perhaps the simplest way to demonstrate the difference is to look at the number of opcodes generated in order to compare and count two sets of 64 bits stored as: two arrays; two strings of ascii 1s and 0s; two bitstrings of 64 bits each. You don't need to understand the opcodes to see the reduction.
Moving as much of the work (looping) into the optimised, compiled-C, opcodes just saves huge swaths of time and processor:
- Arrays:
C:\test>perl -MO=Terse -E"@a=map{int rand 2}1..64;@b=map{int rand 2}1.
+.64; for my$a(@a){ for my $b(@b){ $a==$b and ++$count }}"
LISTOP (0x34e7c58) leave [1]
OP (0x34eec40) enter
COP (0x34e7c98) nextstate
BINOP (0x34e7d00) aassign [9]
UNOP (0x34e7d70) null [142]
OP (0x34e7d40) pushmark
LOGOP (0x34e7e90) mapwhile [8]
LISTOP (0x34e7f00) mapstart
OP (0x34e7ed0) pushmark
UNOP (0x34e7e58) null
UNOP (0x34e7f40) null
LISTOP (0x34e80d0) scope
OP (0x34e8110) null [177]
UNOP (0x34e8178) int [4]
UNOP (0x34e81b0) rand [3]
SVOP (0x34e81e8) const [7] IV
+(0x33cca88) 2
UNOP (0x34e7f78) rv2av
SVOP (0x34e7e20) const [26] AV (0x33c7570)
UNOP (0x34e7de0) null [142]
OP (0x34e7db0) pushmark
UNOP (0x34e8220) rv2av [2]
PADOP (0x34e8258) gv GV (0xa76c8) *a
COP (0x34e7660) nextstate
BINOP (0x34e76c8) aassign [18]
UNOP (0x34e7738) null [142]
OP (0x34e7708) pushmark
LOGOP (0x34e7858) mapwhile [17]
LISTOP (0x34e78c8) mapstart
OP (0x34e7898) pushmark
UNOP (0x34e7820) null
UNOP (0x34e7908) null
LISTOP (0x34e7a98) scope
OP (0x34e7ad8) null [177]
UNOP (0x34e7b40) int [13]
UNOP (0x34e7b78) rand [12]
SVOP (0x34e7bb0) const [16] IV
+ (0x33c6e30) 2
UNOP (0x34e7940) rv2av
SVOP (0x34e77e8) const [27] AV (0x33c6830)
UNOP (0x34e77a8) null [142]
OP (0x34e7778) pushmark
UNOP (0x34e7be8) rv2av [11]
PADOP (0x34e7c20) gv GV (0x33c6f40) *b
COP (0x34eecb0) nextstate
BINOP (0x34eed18) leaveloop
LOOP (0x34eee30) enteriter [19]
OP (0x34eee88) null [3]
UNOP (0x34eef28) null [142]
OP (0x34eeef8) pushmark
UNOP (0x34ef568) rv2av [21]
PADOP (0x34e75b8) gv GV (0xa76c8) *a
UNOP (0x34eed58) null
LOGOP (0x34eed90) and
OP (0x34eee00) iter
LISTOP (0x34eef68) lineseq
COP (0x34eefa8) nextstate
BINOP (0x34ef010) leaveloop
LOOP (0x34ef128) enteriter [22]
OP (0x34ef180) null [3]
UNOP (0x34ef220) null [142]
OP (0x34ef1f0) pushmark
UNOP (0x34ef4c8) rv2av [24]
PADOP (0x34ef500) gv GV (0x33c6f4
+0) *b
UNOP (0x34ef050) null
LOGOP (0x34ef088) and
OP (0x34ef0f8) iter
LISTOP (0x34ef260) lineseq
COP (0x34ef2a0) nextstate
UNOP (0x34ef308) null
LOGOP (0x34ef340) and
BINOP (0x34ef428) eq
OP (0x34ef498) padsv [
+19]
OP (0x34ef468) padsv [
+22]
UNOP (0x34ef380) preinc
UNOP (0x34ef3b8) null
+[15]
PADOP (0x34ef3f0)
+gvsv GV (0x33c5ed0) *count
OP (0x34ef0c8) unstack
OP (0x34eedd0) unstack
-e syntax OK
- Strings:
C:\test>perl -MO=Terse -E"$a=join'',map{int rand 2}1..64;@b=map{int ra
+nd 2}1..64; $count=($a&$b)=~tr[1][]"
LISTOP (0x3447bc0) leave [1]
OP (0x344f178) enter
COP (0x3447c00) nextstate
BINOP (0x3447c68) sassign
LISTOP (0x3447cd8) join [8]
OP (0x3447ca8) pushmark
SVOP (0x3448118) const [22] PV (0x332ca20) ""
LOGOP (0x3447d88) mapwhile [7]
LISTOP (0x3447df8) mapstart
OP (0x3447dc8) pushmark
UNOP (0x3447d50) null
UNOP (0x3447e38) null
LISTOP (0x3447fc8) scope
OP (0x3448008) null [177]
UNOP (0x3448070) int [3]
UNOP (0x34480a8) rand [2]
SVOP (0x34480e0) const [6] IV
+(0x332cb58) 2
UNOP (0x3447e70) rv2av
SVOP (0x3447d18) const [23] AV (0x3327640)
UNOP (0x3448150) null [15]
PADOP (0x3448188) gvsv GV (0xa76a8) *a
COP (0x34475c8) nextstate
BINOP (0x3447630) aassign [17]
UNOP (0x34476a0) null [142]
OP (0x3447670) pushmark
LOGOP (0x34477c0) mapwhile [16]
LISTOP (0x3447830) mapstart
OP (0x3447800) pushmark
UNOP (0x3447788) null
UNOP (0x3447870) null
LISTOP (0x3447a00) scope
OP (0x3447a40) null [177]
UNOP (0x3447aa8) int [12]
UNOP (0x3447ae0) rand [11]
SVOP (0x3447b18) const [15] IV
+ (0x3326f00) 2
UNOP (0x34478a8) rv2av
SVOP (0x3447750) const [24] AV (0x3326900)
UNOP (0x3447710) null [142]
OP (0x34476e0) pushmark
UNOP (0x3447b50) rv2av [10]
PADOP (0x3447b88) gv GV (0x3327010) *b
COP (0x344f1e8) nextstate
BINOP (0x344f250) sassign
UNOP (0x344f290) null
BINOP (0x344f3e8) bit_and [21]
UNOP (0x344f498) null [15]
PADOP (0x34474e0) gvsv GV (0xa76a8) *a
UNOP (0x344f428) null [15]
PADOP (0x344f460) gvsv GV (0x3327010) *b
PVOP (0x344f3b0) trans
UNOP (0x3447518) null [15]
PADOP (0x3447550) gvsv GV (0x33262d0) *count
-e syntax OK
- Bits:
C:\test>perl -MO=Terse -E"$a=int rand 2**64;$b=int rand 2**64; $count
+= unpack '%32b*', $a & $b"
LISTOP (0x33e7460) leave [1]
OP (0x33e6e60) enter
COP (0x33e74a0) nextstate
BINOP (0x33e7508) sassign
UNOP (0x33e7548) int [4]
UNOP (0x33e7580) rand [3]
SVOP (0x33e75b8) const [13] NV (0x32ca498) 1.844674407
+37096e+019
UNOP (0x33e76a0) null [15]
PADOP (0x33e76d8) gvsv GV (0x107668) *a
COP (0x33e71f0) nextstate
BINOP (0x33e7258) sassign
UNOP (0x33e7298) int [8]
UNOP (0x33e72d0) rand [7]
SVOP (0x33e7308) const [14] NV (0x32ca5a0) 1.844674407
+37096e+019
UNOP (0x33e73f0) null [15]
PADOP (0x33e7428) gvsv GV (0x32ca510) *b
COP (0x33e6ed0) nextstate
BINOP (0x33e6f38) sassign
LISTOP (0x33e6fa8) unpack
OP (0x33e6f78) null [3]
SVOP (0x33e7108) const [15] PV (0x32ca600) "%32b*"
BINOP (0x33e6fe8) bit_and [12]
UNOP (0x33e7098) null [15]
PADOP (0x33e70d0) gvsv GV (0x107668) *a
UNOP (0x33e7028) null [15]
PADOP (0x33e7060) gvsv GV (0x32ca510) *b
UNOP (0x33e7140) null [15]
PADOP (0x33e7178) gvsv GV (0x32ca5d0) *count
-e syntax OK
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
use strict;
use warnings;
use Benchmark 'cmpthese';
sub create { map {rand() < $_[1] ? 1 : 0} 1..$_[0] }
sub compare2a { # first find 1s in x, then check in ys
my $x = shift;
my $n = shift;
my @nxs = grep { $x->[$_] } 0..$n-1;
return map { scalar grep {$_} @{$_}[@nxs] } @_;
}
sub compare4 { # bitstrings
my $x = shift;
$x = pack 'b*', join '', @$x;
return map { unpack '%32b*', ( $x & pack 'b*', join'',@$_ ) } @_;
}
my $n = 15000;
my $p = 0.005;
my $ny = 10;
my @x = create $n, $p;
my @ys = map { [ create $n, $p ] } 1..$ny;
my @r2a = compare2a \@x, $n, @ys;
my @r4 = compare4 \@x, @ys;
print "compare2a: @r2a\n";
print "compare4: @r4\n";
cmpthese( -5, {
compare2a => sub{ compare2a \@x, $n, @ys },
compare4 => sub{ compare4 \@x, @ys },
}
);
Result:
Rate compare4 compare2a
compare4 246/s -- -55%
compare2a 543/s 120% --
| [reply] [d/l] [select] |
|
Re: Comparing two arrays
by hdb (Monsignor) on Dec 15, 2013 at 10:38 UTC
|
How about turning the arrays into strings and then using boolean & on them, using tr to count the 1s?
my @x = qw( 0 0 0 0 1 1 0 0 0 0 1 0 );
my @y = qw( 1 0 0 0 1 0 1 0 1 0 1 0 );
my $x = join '', @x;
my $y = join '', @y;
my $compare = $x & $y;
my $joint = $compare =~ tr/1/1/;
print "$joint\n";
Can you throw this into your benchmarking? @x obviously would need to be turned into a string only once. | [reply] [d/l] [select] |
|
| [reply] |
|
use strict;
use warnings;
use Benchmark 'cmpthese';
sub create { map {rand() < $_[1] ? 1 : 0} 1..$_[0] }
sub compare1 { # naive
my $x = shift;
my $n = shift;
return map { my $y=$_; scalar grep { $y->[$_] == 1 and $x->[$_] == 1
+ } 0..$n-1 } @_;
}
sub compare2 { # first find 1s in x, then check in ys
my $x = shift;
my $n = shift;
my @nxs = grep { $x->[$_] } 0..$n-1;
return map { my $y=$_; scalar grep { $y->[$_] == 1 } @nxs } @_ ;
}
sub compare3 { # stringify
my $x = shift;
$x = join '', @$x;
return map { my $j = $x & join'',@$_; $j =~ tr/1/1/ } @_;
}
my $n = 15000;
my $p = 0.005;
my $ny = 10;
my @x = create $n, $p;
my @ys = map { [ create $n, $p ] } 1..$ny;
my @r1 = compare1 \@x, $n, @ys;
my @r2 = compare2 \@x, $n, @ys;
my @r3 = compare3 \@x, @ys;
print "compare1: @r1\n";
print "compare2: @r2\n";
print "compare3: @r3\n";
cmpthese( -5, { compare1 => sub{ compare1 \@x, $n, @ys },
compare2 => sub{ compare2 \@x, $n, @ys },
compare3 => sub{ compare3 \@x, @ys },
}
);
Results:
Rate compare1 compare3 compare2
compare1 48.8/s -- -81% -91%
compare3 253/s 420% -- -53%
compare2 537/s 1002% 112% --
| [reply] [d/l] [select] |
Re: Comparing two arrays
by Lennotoecom (Pilgrim) on Dec 15, 2013 at 12:13 UTC
|
1.
please tell what exactly you want to see at the end
of your calculations?
@a = 1 0 1
@b = 1 0 0
@c = ?
2.
15000int's long means length of an each array?
15000 elements in each array? | [reply] |
|
oh i missed that part , sorry
@a = 1 0 1
@b = 1 0 0
@c = 1 0 0 # unnecessary as array , just the final count of 1's in arr
+ay @c
what i need is the total number of matched 1's so in the example above i just need number 1 as a result.
"
15000int's long means length of an each array?
15000 elements in each array?
"
yes 15000 integer values in any array that i'm dealing with. and yes 15000 (1's or 0's). not 15000 of just 1's not 15000 of just 0's and not #1's + #0's = 15000. 15000 elements over the alphabet of size 2: array(x or y) = {x | x in A} where A={1,0} | [reply] [d/l] [select] |
Re: Comparing two arrays
by oiskuu (Hermit) on Dec 15, 2013 at 18:57 UTC
|
Essentially then, you have a sparse matrix of booleans. With a density of 1/500, bitvectors will not produce the most compact form.
However, the most compact form is not of essence here. When you search arbitrary data in multiple dimensions or by multiple keys, you'll need to index by each key. In effect you create a number of copies of the data. Here the index (address) of a bit is many times bigger than that one bit itself.
Now, here's the solution I'm thinking of. Make an AoA of the vectors X, bucketed by bit ID. That is, 15k buckets each containing estimated 200 ID's of vectors X. Make an array N[X] that counts hits per vector X. These structures you can keep in memory.
Update2. Reconsidering the above, I realise intermediate storage is unnecessary. However, if you populate 15k buckets with all Y vectors (this time), there will be approx 10k entries per bucket, totaling 4*10k*15k = 600M bytes. Sort each bucket. Matching an X vector then involves scanning ~30 ways *10k entries in a manner that is very similar to merge sort. This should take less than a second... (Inline C)
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
Re: Comparing two arrays
by educated_foo (Vicar) on Dec 15, 2013 at 12:57 UTC
|
In addition to what others have said -- use packed bit-arrays -- you can sort the X arrays first, then do a binary search to find the closest X for each Y (~16 comparisons apiece). | [reply] |
|
| [reply] |
|
Sort by what criteria? "closest" by what criteria?
That obviously depends upon what the OP is trying to do. Maybe he wants lexicographic sorting, or maybe some other mapping; it's not specified in the question, so I didn't guess.
| [reply] |
|
|
|
Re: Comparing two arrays
by Anonymous Monk on Dec 15, 2013 at 21:43 UTC
|
I would try to use a hash to some advantage. For example, total up the sum of the entries and perhaps the index of the leftmost '1', then make that into a hash-key. The essential goal is to minimize the number of arrays that you actually search, no matter how you go about searching or representing them. The hash won't tell you which one (if any) matches, but only entries in that chosen bucket need be considered. | [reply] |
|
The essential goal is to minimize the number of arrays that you actually search, no matter how you go about searching or representing them.
This was my thought too; with that many items to cross-compare, doing various "cheats" to cut down on how many you actually need to search is profitable. Even something as simple as just keeping counts of the number of 1's in each array can give you a real cheap method of asking "is it possible that these two arrays are the same?"
You'll get false positives (probably a lot of them), but you can guarantee no false negatives, so that can chop orders of magnitude off the problem. With a 15,000-element array, you know there's some number between 0 and 15000 1's; if they're evenly distributed (which they won't be even close, but it's a starting point to guess) then you're immediately dropping from "test these 5 million" to "test these 333", which is a huge speedup. Even if the distribution is off by an order or two of magnitude, dropping to 3 or 30,000 comparisons is a giant gain from 5 million.
Then adding some additional cheap to store/compare (maybe not necessarily calculate, but you may be able do this at the same time as the arrays get built for minimal added expense) hints like the leftmost-1-index or rightmost-1-index or longer-run-of-1's or some other thing meaningful to your data set, can chop that down even further. At what point the diminishing returns start hurting you is situation specific; it would take knowing your data and possibly some experimentation to find it.
In essence, these sort of universe-narrowing tricks can be seen as variations on the theme of Bloom filters. A quick glance at cpan shows up some Bloom filter implementations, but it's hard to say whether they're well suited to this particular type/scale of usage. But they may on deeper investigation prove to be so...
| [reply] |
Re: Comparing two arrays
by locked_user sundialsvc4 (Abbot) on Dec 16, 2013 at 13:05 UTC
|
Here’s another way to do it: scan the arrays to convert them to a vector using some form of “run-length encoding.” For instance, a list containing the position of the next '1' and the number of '1's that occur. Now for the magic: use a Digest algorithm of some sort ... could be CRC, SHA1, could be anything ... to convert that vector into a single value that can be used as a hash.
Actually, if you use a truly-decent message digest algorithm, like SHA1, you probably won’t have to do anything else, because one and only one vector will map to the same hash value. You have to preprocess each array one time to compute its message-digest hash, but from there on out you might not have to examine the array contents at all. A comparison of the digests ought to be sufficient. Aside from the overhead of passing through each array one time to calculate its digest, the overhead of laborious comparison ... disappears.
| |
|
sundialsvc4:
If you're looking for duplicate vectors, digesting can greatly reduce the number of comparisons, but it won't let you escape them altogether: http://en.wikipedia.org/wiki/Pigeonhole_principle.
Update: s/While/If you're looking for duplicate vectors/ and changed conjunction so the sentence still reads well.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
| [reply] |
|
While digesting can greatly reduce the number of comparisons
That would only be true if the OP was looking for exact matches. He isn't.
He's looking for the best matches, where 'best' is defined in terms of the number of set bits in matching positions. No hashing, digesting nor sorting approach to this problem is possible.
Every X must be fully compared against every Y.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
one and only one vector will map to the same hash value
Do you mean there are no hash collisions for SHA1?
| [reply] |
|
the overhead of laborious comparison ... disappears.
Total garbage.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in.
|
|
| [reply] |