Re: using hash uniqueness to find the odd man

Please post code if you want comments on its efficiency. This is a short way that uses grep to find both unique and matching elements in two arrays.

Update

no_slogan points out that them embedded loops do not scale well as the slow gometrically, so I have added a hash solution. Just to be consistent this includes grep too :-)

my @a = (0,1,2,3,4,5, 6, 7, 8 );
my @b = (0,2,4,6,8,10,12,14,16);
my (@matches,@mismatch);

print "An iterative solution\n";
foreach $a (@a) {
   push @matches, $a if grep{$a == $_}@b;
}
foreach $a (@a) {
   push @mismatch, $a unless grep{$a == $_}@b;
}
print "The arrays share: @matches\n";
print "Unique elements in \@a: @mismatch\n";

print"\nA hash solution\n";
@matches = @mismatch = ();

map{$seen{$_}++}@b;
@matches  = grep{defined $seen{$_}}@a;
@mismatch = grep{!defined $seen{$_}}@a;

print "The arrays share: @matches\n";
print "Unique elements in \@a: @mismatch\n";
[download]

Always interested to see other ways to do things, so please post your code. Note that although this code is short, in perl shorter is often slower, so it may be that your method is faster. If it is also shorter it looks like a new round of GOLF :-) If you post some code I will run a Benchmark or you can do it youself, see this node Benchmark made easy for how.

Update 2

I have benchmarked the code above. As predicted by no_slogan the results of the embedded loop structure are so embarasing I am going to hide them behind a readmore. Using a 10,000 record data set here is what happened....

use Benchmark;

$build = <<'BUILD';
my @a = (0..10000);
my @b = (9000..19000);
BUILD

$iterate = <<'ITERATE';
my @a = (0..10000);
my @b = (9000..19000);
my (@matches,@mismatch);
@matches = @mismatch = ();
foreach $a (@a) {
   push @matches, $a if grep{$a == $_}@b;
}
foreach $a (@a) {
   push @mismatch, $a unless grep{$a == $_}@b;
}
ITERATE

$hash = <<'HASH';
my @a = (0..10000);
my @b = (9000..19000);
my (@matches,@mismatch);
@matches = @mismatch = ();
map{$seen{$_}++}@b;
@matches  = grep{defined $seen{$_}}@a;
@mismatch = grep{!defined $seen{$_}}@a;
HASH

timethese(1,{
    'build'     =>  $build,
    'iterate'   =>  $iterate,
    'hash'      =>  $hash
    });

__END__
C:\>perl test.pl
Benchmark: timing 1 iterations of build, hash, iterate...
     build: -1 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
            (warning: too few iterations for a reliable count)
      hash:  0 wallclock secs ( 0.16 usr +  0.00 sys =  0.16 CPU) @  6
+.25/s (n=1
)
            (warning: too few iterations for a reliable count)
   iterate: 147 wallclock secs (147.20 usr +  0.00 sys = 147.20 CPU) @
+  0.01/s (
n=1)
            (warning: too few iterations for a reliable count)

C:\>
[download]

Ouch! So iteration is about 1000 times slower than the hash method. And that's on a PIII 800 with no other load. Good lesson here methinks.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Comment on Re: using hash uniqueness to find the odd man Select or Download Code

Replies are listed 'Best First'.
Re: Re: using hash uniqueness to find the odd man by no_slogan (Deacon) on Jul 11, 2001 at 05:18 UTC
Your code runs in O(@a * @b) time because of the grep nested inside the foreach. The hash approach will run in O(A + @b), where A is the time to build the hash %seen out of the values of @a (probably about @alog(@a)). Hashes should be faster if the data sets are somewhat large. If the two lists of DNs are sorted initially, you can use a merge algorithm (as in the unix comm program). That is an optimal solution, because it takes O(@a + @b) time, and there's no way to solve the problem without examining everything in both lists. If you have to sort the lists first, that's more expensive than the hash method. Update:* Now that I'm getting analysis of algorithms paged back in, I'm thinking A should have been @a in the first paragraph. The hashtable has to expand about log(@a) times, but each time it only has to reconsider the elements that were previously inserted, so it's @a + @a/2 + @a/4 + ... = @a*2, which is proportional to @a.	[reply]
Re: Re: Re: using hash uniqueness to find the odd man by tachyon (Chancellor) on Jul 11, 2001 at 05:26 UTC
Yeah I agree it is not the best solution, I kinda mentioned it might well be slow. But today is grep week so what can you do? As you point out it will slow in a geometric fashion. I have added an update that uses a hash solution - with grep of course :-) Update Just benchmarked the code on a 10,000 element data set, results not as bad as I feared - the iterative method is only 3 orders of magnitude slower. I think three orders of magnitude sounds better than 1000 times! cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: using hash uniqueness to find the odd man by novitiate (Scribe) on Jul 11, 2001 at 07:16 UTC
oak. here's the code. (i changed to ++). the files handles are just where i am reading the DNs from. `while (<XCH0S>) { @array = split(",", $_); $array[2] =~ tr/A-Z/a-z/; chomp $array[2]; $xch0{$array[2]}++; } while (<XCH3S>) { @array2 = split(",", $_); $array2[2] =~ tr/A-Z/a-z/; chomp $array2[2]; if (!defined $xch0{$array2[2]}) { print "DN of extra record is: $array2[2]"; } else { next; } }` [download] i tested this against your hash implementation and i got these benchies: Benchmark: timing 10000 iterations of my code, their code... my code: 1 wallclock secs ( 0.92 usr + 0.06 sys = 0.98 CPU) @ 10 +193.68/s (n=10000) their code: 0 wallclock secs ( 0.13 usr + 0.05 sys = 0.18 CPU) @ 55 +555.56/s (n=10000) (warning: too few iterations for a reliable count) C:\network>cmp2.pl Benchmark: timing 110000 iterations of my code, their code... my code: 3 wallclock secs ( 1.97 usr + 0.51 sys = 2.48 CPU) @ 44 +301.25/s (n=110000) their code: 2 wallclock secs ( 1.63 usr + 0.37 sys = 2.00 CPU) @ 54 +917.62/s (n=110000) C:\network>cmp2.pl Benchmark: timing 1000000 iterations of my code, their code... my code: 16 wallclock secs (11.07 usr + 4.80 sys = 15.86 CPU) @ 63 +043.75/s (n=1000000) their code: 18 wallclock secs (13.78 usr + 4.59 sys = 18.37 CPU) @ 54 +445.47/s (n=1000000) C:\network>cmp2.pl Benchmark: timing 2000000 iterations of my code, their code... my code: 33 wallclock secs (21.50 usr + 9.36 sys = 30.86 CPU) @ 64 +800.41/s (n=2000000) their code: 40 wallclock secs (28.05 usr + 8.83 sys = 36.88 CPU) @ 54 +226.99/s (n=2000000) [download] UPDATE: thank you node=tachyon for pointing out those typos - working from two different versions - sorry for the confusion. humbly, novitiate "...goodnight you princes of main(e)" --The Cider House Rules	[reply] [d/l] [select]
Re: Re: Re: using hash uniqueness to find the odd man by tachyon (Chancellor) on Jul 11, 2001 at 16:43 UTC
Looks about the same, as expected as the hash methodology is the same. However the code you have posted does not actually do much, I presume they are typos, but if you tested this code the results dont mean much I'm afraid. In the second while loop you set up @array2 but then refer to @array - which because you do not lexically scope it using 'my' will be set to the last value found in the first while loop. You also do not set $dn so this will always be null. You don't need the else next bit at the end BTW - you do not gain anything from this as you are already at the end of the loop block. I would be interested as to how this code goes if you have time to test it. `while (<XCH0S>) { chomp; tr/A-Z/a-z/; $xch0{(split",")[2]}++; } while (<XCH3S>) { chomp; tr/A-Z/a-z/; my $dn = (split",")[2]; print "DN of extra record is: $dn\n" if !defined $xch0{$dn}; }` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]