Finding duplicate keys

Wally Hartshorn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding duplicate keys by Ovid (Cardinal) on Apr 04, 2003 at 16:18 UTC
If you've got them in two arrays, don't use a nested loop as that is over 3 million comparisons that you'll have to make. Instead, use a hash for one of them. `my %search; @search{ @keys2 } = undef; my @found; foreach my $key (@keys1) { push @found => $key if exists $search{$key}; }` [download] When that's done, `@found` will have the duplicate keys. You'll only have a few thousand iterations instead of a few million. Cheers, Ovid New address of my CGI Course. Silence is Evil (feel free to copy and distribute widely - note copyright text)	[reply] [d/l]
Re: Re: Finding duplicate keys by diotalevi (Canon) on Apr 04, 2003 at 18:57 UTC
I'd want to avoid the incrementalism required by using push this way. At least replace that for() loop with a grep() loop. `@found = grep exists $search{$_}, @keys1`	[reply] [d/l]
Re: Finding duplicate keys by diotalevi (Canon) on Apr 04, 2003 at 16:20 UTC
Fixed a bug - replaced the hash value undef() with the key name. Given @list_a and @list_b I construct an anonymous hash of @list_a and then immediately extract a slice from that hash using @list_b as the keys to fetch. @dups now contains the intersection of both lists. `@dups = @{ { map { $_, $_ } @list_a } }{ @list_b } # Broken out. # Duplicate each key - the key must also be the value since it will be + retrieved later. This creates a list useable as a hash. map { $_, $_ } @list_a # Create an anonymous hash from that list { ... } # Fetch a slice from a hash ref using @list_b @{ ... }{ @list_b } # Into dups @dups = ...` [download]	[reply] [d/l]
Re: Finding duplicate keys by dga (Hermit) on Apr 04, 2003 at 16:21 UTC
Hows this? `my %keys; while(<>) { chop; $keys{$_}++; } foreach my $k ( keys %keys ) { print "$k: $keys{$k}\n"; }` [download] This will count the times a particular input line was seen and then print them out in no apperent order. A sort could be added to the foreach if you would like an order, like, for example `sort { $keys{$b} <=> $keys{$a} } ...` if you wanted them in descending order by count of duplicates. Update: Also note that this stores only the comparison data and makes only a single pass through the files.	[reply] [d/l] [select]
•Re: Finding duplicate keys by merlyn (Sage) on Apr 04, 2003 at 17:04 UTC
If you know the keys are unique within each table, one shortcut might be to notice the second time you saw each key in a scan of both: `my @dups = do { my %count; grep ++$count{$_} == 2, @keys_from_one, @keys_from_two; };` [download] -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re: •Re: Finding duplicate keys by pg (Canon) on Apr 04, 2003 at 18:44 UTC
This piece of code really has a performance issue, as internally it uses "brutal force". I bench marked this against Ovid's code, this one took 93 seconds when Ovid's only took 6 seconds, for the same array size. `@keys1 = (1..1000000); #array names would be modified to match the plu +g-in code for (my $a = 2; $a < 1000000; $a += 2) { push @keys2, $a; ##array names would be modified to match the plug +-in code } my $t0 = time(); #plug the code here print time() - $t0;` [download]	[reply] [d/l]
Re: Finding duplicate keys by rir (Vicar) on Apr 04, 2003 at 16:26 UTC
Off hand code: `while ( <FILE1> ) { $key = extract_key_from_line( $_); $file1_keys{$key} ++; } while ( <FILE2> ) { $check_me = extract_key_from_line( $_); do_something( $check_me) if exists $file1_keys($check_me); }` [download]	[reply] [d/l]
Re: Finding duplicate keys by antirice (Priest) on Apr 05, 2003 at 02:14 UTC
A somewhat straightforward solution may be to just sort both arrays (dunno whether or not order is necessary...you said key, so I suspect that it isn't), use two variables to hold the present indices, and just compare the present keys, comparing each as you go along. This reduces the number of comparisons required to as few as scalar @shorterArray and as many as scalar @shorterArray + scalar @longerArray. As an added bonus, only two extra variables are required and there's no memory overhead involving hashes. Anyhow, as my explanations are usually said to be terrible, the following may be what you're looking for: Untested code follows: `my @array1 = sort getArray1; my @array2 = sort getArray2; my ($i1,$i2) = (0,0); while (defined($array1[$i1]) && defined($array2[$i2])) { if ($array1[$i1] eq $array2[$i2]) { print "$array1[$i] is repeated at position array1[$i1] and array2[ +$i2]$/"; $i1++; $i2++; } if ($array1[$i1] > $array2[$i2]) { $i2++; } else { $i1++; } }` [download] Note that the code assumes that each "key" in the individual arrays are unique. Otherwise, it is possible that it'll skip over duplicated keys in cases such as: `@array1 = ( "a", "a", ... ); @array2 = ( "a", "b", ... ); Results: a is repeated at position array1[0] and array2[0]` [download] It will catch the first instance where the first a's appear. But, it will not find anything wrong with the a at $array1[1]. Of course, this case can easily be detected. Updated: a friend thought my explanation was a bit on the terrible side so I added another sentence and reworded the part about comparisons. antirice The first rule of Perl club is - use Perl The ith rule of Perl club is - follow rule i - 1 for i > 1	[reply] [d/l] [select]
Re^2: Finding duplicate keys (cmp) by Aristotle (Chancellor) on Apr 06, 2003 at 00:35 UTC
Why are you using `eq` in one comparison and `>` in the other? Besides, using `cmp` you can avoid the duplicate comparison altogether. Your looping condition is not robust, either. `my @array1 = sort <$file1>; my @array2 = sort <$file2>; my ($l,$r) = (0) x 2; while ($l < @array1 and $r < @array2) { if(my $b = $array1[$l] cmp $array2[$r]) { ++($b == 1 ? $r : $l); } else { print "Dupe: $array1[$l]\n"; ++$l; ++$r; } }` [download] Makeshifts last the longest.	[reply] [d/l]
Re: Re^2: Finding duplicate keys by antirice (Priest) on Apr 06, 2003 at 01:12 UTC
Thanks for pointing out the inadequacies of my code. I didn't think of using cmp as I haven't used it very frequently. :-/ antirice The first rule of Perl club is - use Perl The ith rule of Perl club is - follow rule i - 1 for i ] 1	[reply]
Re: Finding duplicate keys by graff (Chancellor) on Apr 05, 2003 at 02:18 UTC
If you have two plain text files where one file looks like this: `key1 key3 key4 key6 ...` [download] and the other file looks like this: `key2 key3 key4 key5 ...` [download] then you might try downloading this script that I posted here under "Code/Utilities". To find the keys that are common to both files, the command line is: `cmpcol -i file1 file2` [download] To find the union: `cmpcol -u file1 file2` [download] To find the keys that are unique to file1 (or file2): `cmpcol -x1 file1 file2 # what's uniq to file1 cmpcol -x2 file1 file2 # what's uniq to file2 cmpcol -x file1 file2 # all uniq items, tagged by source -- i.e.: key1 <1 key2 <2 key5 <2 key6 <1` [download] It also handles multi-column data (use "file1:4" to use column 4 of file1 as the key field in that file), and supports perl regex field delimiters (e.g. `-d '[\s:;,.]+'` to split each line of each file at any combination of whitespace and/or punctuation). It has more bells and whistles... I hope it helps.	[reply] [d/l] [select]
Re: Finding duplicate keys by Anonymous Monk on Apr 05, 2003 at 18:07 UTC
Here is a solution that I use when I have combined all the data into one file. #!/usr/bin/perl -w use strict; use vars qw ($outfile @array); open(DATA, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!"; $outfile = "whatever.txt"; open (OUT, ">$outfile") \|\| die "Can't open $outfile for creation: $!\n +"; ############## ############## This section could be used if one doesn't care about o +rder #my %uniq; #$uniq{$_}++ while <DATA>; #print OUT for keys %uniq; ############## ############## my (@array, %hash); while (<DATA>) { push (@array, $_) unless (defined($hash{$_})); $hash{$_} = 1; } print OUT join("", @array); [download] I know you have many other useful answers but thought I would post it anyways, Dr.J	[reply] [d/l]
Re: Finding duplicate keys by Oberon (Monk) on Apr 06, 2003 at 19:58 UTC
> but I suspect there's a more elegant way to do this (perhaps even as a one-liner). Well, not that there's anything wrong with everyone else's responses, but I just thought I'd take a crack at it as a one-liner: `perl -lne 'exists $h{$_} ? print "$_ in $h{$_} and $ARGV" : ($h{$_} = +$ARGV)' file1 file2` [download] That's only semi-tested, but it seems to work for me. And you could send it as many files as you liked (although it wouldn't be smart enough to say something like "key in file1 and file2 and file3"; it would instead say "key in file1 and file2" and then later "key in file1 and file3"). This assumes the entire line is the key, which probably isn't the case. Here's a slightly modified version which might, say, catch duplicate uids in two Unix password files: `perl -F: -lane 'exists $h{$F[2]} ? print "$F[2] in $h{$F[2]} and $ARGV +" : ($h{$F[2]} = $ARGV)' passwd1 passwd2` [download] I'm sure the wiser monks will jump in if there's some "gotcha" I missed ...	[reply] [d/l] [select]