Wally Hartshorn has asked for the wisdom of the Perl Monks concerning the following question:

I have two files. The files contain the keys from two tables (one table per file, about 1500 keys in one, about 2300 keys in the other). I want to locate duplicate keys (keys that are in both files).

I know that I could use the "brute force" method and just loop through the two tables, comparing each key from the first file with each key from the second file, but I suspect there's a more elegant way to do this (perhaps even as a one-liner).

Assuming that I've loaded the two files into two arrays, is there some nifty way to generate the results I'm looking for, or is looping through the two arrays pretty much the best solution?

Wally Hartshorn

(Plug: Visit JavaJunkies, PerlMonks for Java)

Replies are listed 'Best First'.
Re: Finding duplicate keys
by Ovid (Cardinal) on Apr 04, 2003 at 16:18 UTC

    If you've got them in two arrays, don't use a nested loop as that is over 3 million comparisons that you'll have to make. Instead, use a hash for one of them.

    my %search; @search{ @keys2 } = undef; my @found; foreach my $key (@keys1) { push @found => $key if exists $search{$key}; }

    When that's done, @found will have the duplicate keys. You'll only have a few thousand iterations instead of a few million.

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

      I'd want to avoid the incrementalism required by using push this way. At least replace that for() loop with a grep() loop.

      @found = grep exists $search{$_}, @keys1
Re: Finding duplicate keys
by diotalevi (Canon) on Apr 04, 2003 at 16:20 UTC

    Fixed a bug - replaced the hash value undef() with the key name.

    Given @list_a and @list_b I construct an anonymous hash of @list_a and then immediately extract a slice from that hash using @list_b as the keys to fetch. @dups now contains the intersection of both lists.

    @dups = @{ { map { $_, $_ } @list_a } }{ @list_b } # Broken out. # Duplicate each key - the key must also be the value since it will be + retrieved later. This creates a list useable as a hash. map { $_, $_ } @list_a # Create an anonymous hash from that list { ... } # Fetch a slice from a hash ref using @list_b @{ ... }{ @list_b } # Into dups @dups = ...
Re: Finding duplicate keys
by dga (Hermit) on Apr 04, 2003 at 16:21 UTC

    Hows this?

    my %keys; while(<>) { chop; $keys{$_}++; } foreach my $k ( keys %keys ) { print "$k: $keys{$k}\n"; }

    This will count the times a particular input line was seen and then print them out in no apperent order. A sort could be added to the foreach if you would like an order, like, for example sort { $keys{$b} <=> $keys{$a} } ... if you wanted them in descending order by count of duplicates.

    Update: Also note that this stores only the comparison data and makes only a single pass through the files.

•Re: Finding duplicate keys
by merlyn (Sage) on Apr 04, 2003 at 17:04 UTC
    If you know the keys are unique within each table, one shortcut might be to notice the second time you saw each key in a scan of both:
    my @dups = do { my %count; grep ++$count{$_} == 2, @keys_from_one, @keys_from_two; };

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      This piece of code really has a performance issue, as internally it uses "brutal force". I bench marked this against Ovid's code, this one took 93 seconds when Ovid's only took 6 seconds, for the same array size.
      @keys1 = (1..1000000); #array names would be modified to match the plu +g-in code for (my $a = 2; $a < 1000000; $a += 2) { push @keys2, $a; ##array names would be modified to match the plug +-in code } my $t0 = time(); #plug the code here print time() - $t0;
Re: Finding duplicate keys
by rir (Vicar) on Apr 04, 2003 at 16:26 UTC
    Off hand code:
    while ( <FILE1> ) { $key = extract_key_from_line( $_); $file1_keys{$key} ++; } while ( <FILE2> ) { $check_me = extract_key_from_line( $_); do_something( $check_me) if exists $file1_keys($check_me); }
Re: Finding duplicate keys
by antirice (Priest) on Apr 05, 2003 at 02:14 UTC
    A somewhat straightforward solution may be to just sort both arrays (dunno whether or not order is necessary...you said key, so I suspect that it isn't), use two variables to hold the present indices, and just compare the present keys, comparing each as you go along. This reduces the number of comparisons required to as few as scalar @shorterArray and as many as scalar @shorterArray + scalar @longerArray. As an added bonus, only two extra variables are required and there's no memory overhead involving hashes. Anyhow, as my explanations are usually said to be terrible, the following may be what you're looking for:

    Untested code follows:
    my @array1 = sort getArray1; my @array2 = sort getArray2; my ($i1,$i2) = (0,0); while (defined($array1[$i1]) && defined($array2[$i2])) { if ($array1[$i1] eq $array2[$i2]) { print "$array1[$i] is repeated at position array1[$i1] and array2[ +$i2]$/"; $i1++; $i2++; } if ($array1[$i1] > $array2[$i2]) { $i2++; } else { $i1++; } }
    Note that the code assumes that each "key" in the individual arrays are unique. Otherwise, it is possible that it'll skip over duplicated keys in cases such as:
    @array1 = ( "a", "a", ... ); @array2 = ( "a", "b", ... ); Results: a is repeated at position array1[0] and array2[0]
    It will catch the first instance where the first a's appear. But, it will not find anything wrong with the a at $array1[1]. Of course, this case can easily be detected.

    Updated: a friend thought my explanation was a bit on the terrible side so I added another sentence and reworded the part about comparisons.

    antirice    
    The first rule of Perl club is - use Perl
    The
    ith rule of Perl club is - follow rule i - 1 for i > 1

      Why are you using eq in one comparison and > in the other? Besides, using cmp you can avoid the duplicate comparison altogether. Your looping condition is not robust, either.
      my @array1 = sort <$file1>; my @array2 = sort <$file2>; my ($l,$r) = (0) x 2; while ($l < @array1 and $r < @array2) { if(my $b = $array1[$l] cmp $array2[$r]) { ++($b == 1 ? $r : $l); } else { print "Dupe: $array1[$l]\n"; ++$l; ++$r; } }

      Makeshifts last the longest.

        Thanks for pointing out the inadequacies of my code. I didn't think of using cmp as I haven't used it very frequently. :-/

        antirice    
        The first rule of Perl club is - use Perl
        The
        ith rule of Perl club is - follow rule i - 1 for i ] 1

Re: Finding duplicate keys
by graff (Chancellor) on Apr 05, 2003 at 02:18 UTC
    If you have two plain text files where one file looks like this:
    key1 key3 key4 key6 ...
    and the other file looks like this:
    key2 key3 key4 key5 ...
    then you might try downloading this script that I posted here under "Code/Utilities". To find the keys that are common to both files, the command line is:
    cmpcol -i file1 file2
    To find the union:
    cmpcol -u file1 file2
    To find the keys that are unique to file1 (or file2):
    cmpcol -x1 file1 file2 # what's uniq to file1 cmpcol -x2 file1 file2 # what's uniq to file2 cmpcol -x file1 file2 # all uniq items, tagged by source -- i.e.: key1 <1 key2 <2 key5 <2 key6 <1
    It also handles multi-column data (use "file1:4" to use column 4 of file1 as the key field in that file), and supports perl regex field delimiters (e.g. -d '[\s:;,.]+' to split each line of each file at any combination of whitespace and/or punctuation).

    It has more bells and whistles... I hope it helps.

Re: Finding duplicate keys
by Anonymous Monk on Apr 05, 2003 at 18:07 UTC
    Here is a solution that I use when I have combined all the data into one file.
    #!/usr/bin/perl -w use strict; use vars qw ($outfile @array); open(DATA, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!"; $outfile = "whatever.txt"; open (OUT, ">$outfile") || die "Can't open $outfile for creation: $!\n +"; ############## ############## This section could be used if one doesn't care about o +rder #my %uniq; #$uniq{$_}++ while <DATA>; #print OUT for keys %uniq; ############## ############## my (@array, %hash); while (<DATA>) { push (@array, $_) unless (defined($hash{$_})); $hash{$_} = 1; } print OUT join("", @array);
    I know you have many other useful answers but thought I would post it anyways,

    Dr.J
Re: Finding duplicate keys
by Oberon (Monk) on Apr 06, 2003 at 19:58 UTC

    > but I suspect there's a more elegant way to do this (perhaps even as a one-liner).

    Well, not that there's anything wrong with everyone else's responses, but I just thought I'd take a crack at it as a one-liner:

    perl -lne 'exists $h{$_} ? print "$_ in $h{$_} and $ARGV" : ($h{$_} = +$ARGV)' file1 file2

    That's only semi-tested, but it seems to work for me. And you could send it as many files as you liked (although it wouldn't be smart enough to say something like "key in file1 and file2 and file3"; it would instead say "key in file1 and file2" and then later "key in file1 and file3").

    This assumes the entire line is the key, which probably isn't the case. Here's a slightly modified version which might, say, catch duplicate uids in two Unix password files:

    perl -F: -lane 'exists $h{$F[2]} ? print "$F[2] in $h{$F[2]} and $ARGV +" : ($h{$F[2]} = $ARGV)' passwd1 passwd2

    I'm sure the wiser monks will jump in if there's some "gotcha" I missed ...