novitiate has asked for the wisdom of the Perl Monks concerning the following question:

I had a problem where i had two lists of DNs (ie. /o=org/ou=org - etc.) where listX was one record
longer than listY. i needed to nail the one that was absent from listY and present in listX. so i wrote a script
to populate %hash using the DNs from listY as the keys and then checked each DN from listX to see if it was defined as
a key in %hash. thus producing the culprit when the test failed. my question is - is this efficient ? is there a
better algorithm that is as easy to code ?

consideration: no non-core modules, we are dealing with only about 12000 records.
finally, for future reference, is there a name for the technique i used ?

humbly,
novitiate

"...goodnight you princes of main(e)"  --The Cider House Rules
  • Comment on using hash uniqueness to find the odd man

Replies are listed 'Best First'.
Re: using hash uniqueness to find the odd man
by japhy (Canon) on Jul 11, 2001 at 06:44 UTC
    I have a potentially super-fast mechanism. Hash slices:
    my %seen; @seen{@larger_list} = (); delete @seen{@smaller_list}; my @culprits = keys %culprit;
    However, if you have the data in files to begin with, it's probably best NOT to put them in arrays.

    japhy -- Perl and Regex Hacker
Re: using hash uniqueness to find the odd man
by blakem (Monsignor) on Jul 11, 2001 at 04:32 UTC
    Without seeing your code, its tough to know if there is a "better algorithm." However, what you describe sounds like a very perlish solution. Assuming you have two arrays (@a and @b) and you know that one of them (@b) has an extra element, one solution would be:
    @a = (1,2,4); @b = (1,2,3,4); my %seen; for (@a) { $seen{$_}++; } for (@b) { print "$_ was in \@b but not in \@a\n" unless $seen{$_}; }

    Oh, I believe that this is called finding the difference of the arrays.

    -Blake

      my code is roughly the same -
      s/$seen{$_}++/$seen{$_} = 1;/
      almost the same,right?

      humbly,
      novitiate

      "...goodnight you princes of main(e)"  --The Cider House Rules
        I would recommend using $seen{$_}++ over $seen{$_}=1 because with the former you have additional information, such as how many times you have seen that particular item. This can be useful for detecting duplicate keys, for example.

        As such, these two are perhaps functionally the same in a narrow set of circumstances. Using the ++ approach is much more adaptable, so it should be used in preferance to simple assignment.
Re: using hash uniqueness to find the odd man
by tachyon (Chancellor) on Jul 11, 2001 at 05:08 UTC

    Please post code if you want comments on its efficiency. This is a short way that uses grep to find both unique and matching elements in two arrays.

    Update

    no_slogan points out that them embedded loops do not scale well as the slow gometrically, so I have added a hash solution. Just to be consistent this includes grep too :-)

    my @a = (0,1,2,3,4,5, 6, 7, 8 ); my @b = (0,2,4,6,8,10,12,14,16); my (@matches,@mismatch); print "An iterative solution\n"; foreach $a (@a) { push @matches, $a if grep{$a == $_}@b; } foreach $a (@a) { push @mismatch, $a unless grep{$a == $_}@b; } print "The arrays share: @matches\n"; print "Unique elements in \@a: @mismatch\n"; print"\nA hash solution\n"; @matches = @mismatch = (); map{$seen{$_}++}@b; @matches = grep{defined $seen{$_}}@a; @mismatch = grep{!defined $seen{$_}}@a; print "The arrays share: @matches\n"; print "Unique elements in \@a: @mismatch\n";

    Always interested to see other ways to do things, so please post your code. Note that although this code is short, in perl shorter is often slower, so it may be that your method is faster. If it is also shorter it looks like a new round of GOLF :-) If you post some code I will run a Benchmark or you can do it youself, see this node Benchmark made easy for how.

    Update 2

    I have benchmarked the code above. As predicted by no_slogan the results of the embedded loop structure are so embarasing I am going to hide them behind a readmore. Using a 10,000 record data set here is what happened....
      Your code runs in O(@a * @b) time because of the grep nested inside the foreach. The hash approach will run in O(A + @b), where A is the time to build the hash %seen out of the values of @a (probably about @a*log(@a)). Hashes should be faster if the data sets are somewhat large.

      If the two lists of DNs are sorted initially, you can use a merge algorithm (as in the unix comm program). That is an optimal solution, because it takes O(@a + @b) time, and there's no way to solve the problem without examining everything in both lists. If you have to sort the lists first, that's more expensive than the hash method.

      Update: Now that I'm getting analysis of algorithms paged back in, I'm thinking A should have been @a in the first paragraph. The hashtable has to expand about log(@a) times, but each time it only has to reconsider the elements that were previously inserted, so it's @a + @a/2 + @a/4 + ... = @a*2, which is proportional to @a.

        Yeah I agree it is not the best solution, I kinda mentioned it might well be slow. But today is grep week so what can you do? As you point out it will slow in a geometric fashion. I have added an update that uses a hash solution - with grep of course :-)

        Update

        Just benchmarked the code on a 10,000 element data set, results not as bad as I feared - the iterative method is only 3 orders of magnitude slower. I think three orders of magnitude sounds better than 1000 times!

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      oak. here's the code. (i changed to ++). the files handles are just where i am reading the DNs from.
      while (<XCH0S>) { @array = split(",", $_); $array[2] =~ tr/A-Z/a-z/; chomp $array[2]; $xch0{$array[2]}++; } while (<XCH3S>) { @array2 = split(",", $_); $array2[2] =~ tr/A-Z/a-z/; chomp $array2[2]; if (!defined $xch0{$array2[2]}) { print "DN of extra record is: $array2[2]"; } else { next; } }

      i tested this against your hash implementation and i got these benchies:
      Benchmark: timing 10000 iterations of my code, their code... my code: 1 wallclock secs ( 0.92 usr + 0.06 sys = 0.98 CPU) @ 10 +193.68/s (n=10000) their code: 0 wallclock secs ( 0.13 usr + 0.05 sys = 0.18 CPU) @ 55 +555.56/s (n=10000) (warning: too few iterations for a reliable count) C:\network>cmp2.pl Benchmark: timing 110000 iterations of my code, their code... my code: 3 wallclock secs ( 1.97 usr + 0.51 sys = 2.48 CPU) @ 44 +301.25/s (n=110000) their code: 2 wallclock secs ( 1.63 usr + 0.37 sys = 2.00 CPU) @ 54 +917.62/s (n=110000) C:\network>cmp2.pl Benchmark: timing 1000000 iterations of my code, their code... my code: 16 wallclock secs (11.07 usr + 4.80 sys = 15.86 CPU) @ 63 +043.75/s (n=1000000) their code: 18 wallclock secs (13.78 usr + 4.59 sys = 18.37 CPU) @ 54 +445.47/s (n=1000000) C:\network>cmp2.pl Benchmark: timing 2000000 iterations of my code, their code... my code: 33 wallclock secs (21.50 usr + 9.36 sys = 30.86 CPU) @ 64 +800.41/s (n=2000000) their code: 40 wallclock secs (28.05 usr + 8.83 sys = 36.88 CPU) @ 54 +226.99/s (n=2000000)


      UPDATE: thank you node=tachyon for pointing out those typos - working from two different versions - sorry for the confusion.


      humbly,
      novitiate

      "...goodnight you princes of main(e)"  --The Cider House Rules

        Looks about the same, as expected as the hash methodology is the same. However the code you have posted does not actually do much, I presume they are typos, but if you tested this code the results dont mean much I'm afraid. In the second while loop you set up @array2 but then refer to @array - which because you do not lexically scope it using 'my' will be set to the last value found in the first while loop. You also do not set $dn so this will always be null. You don't need the else next bit at the end BTW - you do not gain anything from this as you are already at the end of the loop block. I would be interested as to how this code goes if you have time to test it.

        while (<XCH0S>) { chomp; tr/A-Z/a-z/; $xch0{(split",")[2]}++; } while (<XCH3S>) { chomp; tr/A-Z/a-z/; my $dn = (split",")[2]; print "DN of extra record is: $dn\n" if !defined $xch0{$dn}; }

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: using hash uniqueness to find the odd man
by mattr (Curate) on Jul 11, 2001 at 10:04 UTC
    Here are two tests.

    If both files (two 12000 lines of rand 10e6's) are presorted with unix sort (70ms each on a PIII 500MHz), you can narrow in with an approximation method (seek, read a character from each file, compare and cut the range in half again.. not a great algorithm maybe) to see where the byte streams diverge. My Perl program took 10ms to get there, and then I stopped working on it.. Diff is faster I think (under 1ms). Perl overhead for just making those calls, see below, is under 1ms, so two sorts and a diff are about 140ms. This resembles the hash time.

    `sort oddbig > oddbigsort`; # these are all backticks `sort oddbig2 > oddbigsort2`; @ans = `diff oddbigsort oddbigsort2`; print "ANS: $ans[1]\n";

    But those hash slices are positively gorgeous. They get my vote!

    Incidentally japhy's code on the same system took 170 ms, see my implementation below. I wonder if the hash code involves something like a sort.

    #!/usr/bin/perl -w use strict; use Benchmark; my $t = &Benchmark::timeit(1,' my %seen; my $f1="oddbigsort"; my $f2="oddbigsort2"; open (A,$f1); open (B,$f2); my @smaller_list = <A>; my @larger_list = <B>; if ($#larger_list < $#smaller_list) { @seen{@smaller_list} = (); delete @seen{@larger_list}; } else { @seen{@larger_list} = (); delete @seen{@smaller_list}; } my @culprits = keys %seen; foreach (@culprits) {print "$_\n";} close (A); close (B); '); print timestr($t),"\n";