Re: Finding duplicate keys
by Ovid (Cardinal) on Apr 04, 2003 at 16:18 UTC
|
If you've got them in two arrays, don't use a nested loop as that is over 3 million comparisons that you'll have to make. Instead, use a hash for one of them.
my %search;
@search{ @keys2 } = undef;
my @found;
foreach my $key (@keys1) {
push @found => $key if exists $search{$key};
}
When that's done, @found will have the duplicate keys. You'll only have a few thousand iterations instead of a few million.
Cheers,
Ovid
New address of my CGI Course.
Silence is Evil (feel free to copy and distribute widely - note copyright text) | [reply] [d/l] |
|
|
| [reply] [d/l] |
Re: Finding duplicate keys
by diotalevi (Canon) on Apr 04, 2003 at 16:20 UTC
|
Fixed a bug - replaced the hash value undef() with the key name.
Given @list_a and @list_b I construct an anonymous hash of @list_a and then immediately extract a slice from that hash using @list_b as the keys to fetch. @dups now contains the intersection of both lists.
@dups = @{ { map { $_, $_ } @list_a } }{ @list_b }
# Broken out.
# Duplicate each key - the key must also be the value since it will be
+ retrieved later. This creates a list useable as a hash.
map { $_, $_ } @list_a
# Create an anonymous hash from that list
{ ... }
# Fetch a slice from a hash ref using @list_b
@{ ... }{ @list_b }
# Into dups
@dups = ...
| [reply] [d/l] |
Re: Finding duplicate keys
by dga (Hermit) on Apr 04, 2003 at 16:21 UTC
|
my %keys;
while(<>)
{
chop;
$keys{$_}++;
}
foreach my $k ( keys %keys )
{
print "$k: $keys{$k}\n";
}
This will count the times a particular input line was seen and then print them out in no apperent order. A sort could be added to the foreach if you would like an order, like, for example sort { $keys{$b} <=> $keys{$a} } ... if you wanted them in descending order by count of duplicates.
Update: Also note that this stores only the comparison data and makes only a single pass through the files.
| [reply] [d/l] [select] |
•Re: Finding duplicate keys
by merlyn (Sage) on Apr 04, 2003 at 17:04 UTC
|
If you know the keys are unique within each table, one shortcut might be to notice the second time you saw each key in a scan of both:
my @dups = do {
my %count;
grep ++$count{$_} == 2, @keys_from_one, @keys_from_two;
};
-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply. | [reply] [d/l] |
|
|
This piece of code really has a performance issue, as internally it uses "brutal force". I bench marked this against Ovid's code, this one took 93 seconds when Ovid's only took 6 seconds, for the same array size.
@keys1 = (1..1000000); #array names would be modified to match the plu
+g-in code
for (my $a = 2; $a < 1000000; $a += 2) {
push @keys2, $a; ##array names would be modified to match the plug
+-in code
}
my $t0 = time();
#plug the code here
print time() - $t0;
| [reply] [d/l] |
Re: Finding duplicate keys
by rir (Vicar) on Apr 04, 2003 at 16:26 UTC
|
while ( <FILE1> ) {
$key = extract_key_from_line( $_);
$file1_keys{$key} ++;
}
while ( <FILE2> ) {
$check_me = extract_key_from_line( $_);
do_something( $check_me) if exists $file1_keys($check_me);
}
| [reply] [d/l] |
Re: Finding duplicate keys
by antirice (Priest) on Apr 05, 2003 at 02:14 UTC
|
A somewhat straightforward solution may be to just sort both arrays (dunno whether or not order is necessary...you said key, so I suspect that it isn't), use two variables to hold the present indices, and just compare the present keys, comparing each as you go along. This reduces the number of comparisons required to as few as scalar @shorterArray and as many as scalar @shorterArray + scalar @longerArray. As an added bonus, only two extra variables are required and there's no memory overhead involving hashes. Anyhow, as my explanations are usually said to be terrible, the following may be what you're looking for:
Untested code follows:
my @array1 = sort getArray1;
my @array2 = sort getArray2;
my ($i1,$i2) = (0,0);
while (defined($array1[$i1]) && defined($array2[$i2])) {
if ($array1[$i1] eq $array2[$i2]) {
print "$array1[$i] is repeated at position array1[$i1] and array2[
+$i2]$/";
$i1++;
$i2++;
}
if ($array1[$i1] > $array2[$i2]) { $i2++; }
else { $i1++; }
}
Note that the code assumes that each "key" in the individual arrays are unique. Otherwise, it is possible that it'll skip over duplicated keys in cases such as:
@array1 = ( "a", "a", ... );
@array2 = ( "a", "b", ... );
Results:
a is repeated at position array1[0] and array2[0]
It will catch the first instance where the first a's appear. But, it will not find anything wrong with the a at $array1[1]. Of course, this case can easily be detected.
Updated: a friend thought my explanation was a bit on the terrible side so I added another sentence and reworded the part about comparisons.
antirice The first rule of Perl club is - use Perl The ith rule of Perl club is - follow rule i - 1 for i > 1 | [reply] [d/l] [select] |
|
|
Why are you using eq in one comparison and > in the other? Besides, using cmp you can avoid the duplicate comparison altogether. Your looping condition is not robust, either.
my @array1 = sort <$file1>;
my @array2 = sort <$file2>;
my ($l,$r) = (0) x 2;
while ($l < @array1 and $r < @array2) {
if(my $b = $array1[$l] cmp $array2[$r]) {
++($b == 1 ? $r : $l);
}
else {
print "Dupe: $array1[$l]\n";
++$l;
++$r;
}
}
Makeshifts last the longest. | [reply] [d/l] |
|
|
| [reply] |
Re: Finding duplicate keys
by graff (Chancellor) on Apr 05, 2003 at 02:18 UTC
|
If you have two plain text files where one file looks like this:
key1
key3
key4
key6
...
and the other file looks like this:
key2
key3
key4
key5
...
then you might try downloading this script that I posted here under "Code/Utilities". To find the keys that are common to both files, the command line is:
cmpcol -i file1 file2
To find the union:
cmpcol -u file1 file2
To find the keys that are unique to file1 (or file2):
cmpcol -x1 file1 file2 # what's uniq to file1
cmpcol -x2 file1 file2 # what's uniq to file2
cmpcol -x file1 file2 # all uniq items, tagged by source -- i.e.:
key1 <1
key2 <2
key5 <2
key6 <1
It also handles multi-column data (use "file1:4" to use column 4 of file1 as the key field in that file), and supports perl regex field delimiters (e.g. -d '[\s:;,.]+' to split each line of each file at any combination of whitespace and/or punctuation).
It has more bells and whistles... I hope it helps. | [reply] [d/l] [select] |
Re: Finding duplicate keys
by Anonymous Monk on Apr 05, 2003 at 18:07 UTC
|
Here is a solution that I use when I have combined all the data into one file.
#!/usr/bin/perl -w
use strict;
use vars qw ($outfile @array);
open(DATA, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!";
$outfile = "whatever.txt";
open (OUT, ">$outfile") || die "Can't open $outfile for creation: $!\n
+";
##############
############## This section could be used if one doesn't care about o
+rder
#my %uniq;
#$uniq{$_}++ while <DATA>;
#print OUT for keys %uniq;
##############
##############
my (@array, %hash);
while (<DATA>) {
push (@array, $_) unless (defined($hash{$_}));
$hash{$_} = 1;
}
print OUT join("", @array);
I know you have many other useful answers but thought I would post it anyways,
Dr.J | [reply] [d/l] |
Re: Finding duplicate keys
by Oberon (Monk) on Apr 06, 2003 at 19:58 UTC
|
> but I suspect there's a more elegant way to do this (perhaps even as a one-liner).
Well, not that there's anything wrong with everyone else's responses, but I just thought I'd take a crack at it as a one-liner:
perl -lne 'exists $h{$_} ? print "$_ in $h{$_} and $ARGV" : ($h{$_} =
+$ARGV)' file1 file2
That's only semi-tested, but it seems to work for me. And you could send it as many files as you liked (although it wouldn't be smart enough to say something like "key in file1 and file2 and file3"; it would instead say "key in file1 and file2" and then later "key in file1 and file3").
This assumes the entire line is the key, which probably isn't the case. Here's a slightly modified version which might, say, catch duplicate uids in two Unix password files:
perl -F: -lane 'exists $h{$F[2]} ? print "$F[2] in $h{$F[2]} and $ARGV
+" : ($h{$F[2]} = $ARGV)' passwd1 passwd2
I'm sure the wiser monks will jump in if there's some "gotcha" I missed ... | [reply] [d/l] [select] |