How to check for duplicate entries

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to check for duplicate entries by rhesa (Vicar) on Aug 12, 2006 at 13:35 UTC
You could use a single hash, using the Ob2 value as key, and a counter as the value. Keys that end up with a value of 1 are unique to a single Ob1, while keys with a value greater than 1 are common to both. Here's some code: #!/usr/bin/perl my %ob2; while( <DATA> ) { my @v = split; $ob2{ $v[3] } ++; } print "Duplicate Ob2: ", join ', ', sort grep { $ob2{ $_ } > 1 } keys %ob2; __DATA__ HIT object1 563.43.78 object3 123.89.7777 HIT object1 563.43.78 object10 123.89.7777 HIT object1 563.43.78 object2 453.78.122 HIT object1 563.43.78 object5 457.8888.1 HIT object1 563.43.78 object4 123.89.7777 HIT object1 563.43.78 object6 566.2222.11 HIT object2 563.43.78 object3 123.89.7777 HIT object2 563.43.78 object7 456.222.1111 HIT object2 563.43.78 object8 990.7777.66 HIT object2 563.43.78 object5 457.8888.1 HIT object2 563.43.78 object13 123.89.7777 HIT object2 563.43.78 object9 1223.333.111 [download] That prints: `Duplicate Ob2: object3, object5` [download]	[reply] [d/l] [select]
Re: How to check for duplicate entries by liverpole (Monsignor) on Aug 12, 2006 at 15:47 UTC
Hi Angharad, I originally thought your question was fairly simple, but it turned to have some complexity. If I understand what you're asking for, it doesn't matter what Ob1code and Ob2code are paired together in each line, just the total group of Ob1code values and Ob2code values. Additionally, I'm assuming (because it's true in your example) that each distinct Ob2 always has the same Ob2code value associated with it, which means you can use a simple hash to save those values. Here is my solution. Note that I've added 4 lines at the end, since your data set did NOT have any matches for either object1 nor object2. The program merely prints out whether the data matches one or the other of Ob1, or both, or neither; I will leave it to you to assign the results to an appropriate data structure for further processing. #!/usr/bin/perl # Strict use strict; use warnings; # Libraries use Data::Dumper; # Data structures my $pobject1 = { '1' => { }, '2' => { } }; my %object2 = ( ); # Main program # Read all data while (<DATA>) { # Process line only if it matches template if (/^HIT\s+object(\d+)\s+(\S+)\s+object(\d+)\s+(\S+)/) { my ($ob1, $ob1code, $ob2, $ob2code) = ($1, $2, $3, $4); # Save ob1/ob2 info $pobject1->{$ob1}->{$ob1code}++; $object2{$ob2} = $ob2code; } } # # Debugging ... Display results thus far # printf "(debug) %s\n", Dumper($pobject1); # # (debug) $VAR1 = { # '1' => { # '563.43.78' => 6 # }, # '2' => { # '563.43.78' => 6 # } # }; # printf "(debug) %s\n", Dumper(\%object2); # # (debug) $VAR1 = { # '6' => '566.2222.11', # '3' => '123.89.7777', # '7' => '456.222.1111', # '9' => '1223.333.111', # '2' => '453.78.122', # '8' => '990.7777.66', # '4' => '111.222.333', # '13' => '123.89.7777', # '10' => '123.89.7777', # '5' => '457.8888.1' # }; # # Determine which ob2s match both, or just one of the ob1s my @ob2keys = keys %object2; foreach my $ob2 (@ob2keys) { my $ob2code = $object2{$ob2}; my $match1 = exists $pobject1->{'1'}->{$ob2code}; my $match2 = exists $pobject1->{'2'}->{$ob2code}; printf "Ob2 'object#$ob2' Ob2code '$ob2code': "; if ($match1 and $match2) { print "matches BOTH object1 and object2\n"; } elsif ($match1) { print "matches ONLY object1\n"; } elsif ($match2) { print "matches ONLY object2\n"; } else { print "doesn't match either Ob1\n"; } } __DATA__ HIT object1 563.43.78 object3 123.89.7777 HIT object1 563.43.78 object10 123.89.7777 HIT object1 563.43.78 object2 453.78.122 HIT object1 563.43.78 object5 457.8888.1 HIT object1 563.43.78 object4 123.89.7777 HIT object1 563.43.78 object6 566.2222.11 HIT object2 563.43.78 object3 123.89.7777 HIT object2 563.43.78 object7 456.222.1111 HIT object2 563.43.78 object8 990.7777.66 HIT object2 563.43.78 object5 457.8888.1 HIT object2 563.43.78 object13 123.89.7777 HIT object2 563.43.78 object9 1223.333.111 HIT object1 453.78.122 object12 111.222.333 HIT object2 566.2222.11 object14 333.222.111 HIT object1 990.7777.66 object12 111.222.333 HIT object2 990.7777.66 object14 333.222.111 [download] s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l]
Re: How to check for duplicate entries by perlfan (Parson) on Aug 12, 2006 at 13:24 UTC
Why not create a hash using 'Ob2code' as the key that points to an array storing the '0b1's that are associated with a particular 'Ob2code'? I am reasonably sure that this would be an O(n) proceedure, since you would have to run though the list only once to make the proper assignments...not sure if really solves the problem, though.	[reply]
Re: How to check for duplicate entries by graff (Chancellor) on Aug 13, 2006 at 04:13 UTC
If I understand your goal and data, here's a method that uses two hashes -- one to keep track of how many different "Obj1" things there are, and the other to keep track of the distribution of "Obj2" things. It handles all three categories of "Obj2" things (common to both Obj1 things, unique to each), but it requires that the input be pre-sorted: use strict; my %obj1found; my %obj2found; while (<DATA>) { if ( / HIT \s+ (\S+ \s+ [\d.]+) \s+ (\S+ \s+ [\d.]+) /x ) { my ( $obj1, $obj2 ) = ( $1, $2 ); $obj1found{ $obj1 }++; $obj2found{ $obj2 } .= " $obj1 " unless ( $obj2found{ $obj2 } =~ / \Q$obj1\E / ); } } my $match_all = join( ' ', sort keys %obj1found ); # note: two spaces between elements print join( "\t\n", "\nList of Obj2 things found in all Obj1's:", grep { $obj2found{$_} =~ /\Q$match_all\E/ } sort keys %obj2found ), "\n"; for my $obj1 ( sort keys %obj1found ) { print join( "\t\n", "\nList of Obj2 things found only in $obj1:", grep { $obj2found{$_} =~ /^ \Q$obj1\E $/ } sort keys %obj2found ), "\n"; } __DATA__ HIT object1 563.43.78 object3 123.89.7777 HIT object1 563.43.78 object10 123.89.7777 HIT object1 563.43.78 object2 453.78.122 HIT object1 563.43.78 object5 457.8888.1 HIT object1 563.43.78 object4 123.89.7777 HIT object1 563.43.78 object6 566.2222.11 HIT object2 563.43.78 object3 123.89.7777 HIT object2 563.43.78 object7 456.222.1111 HIT object2 563.43.78 object8 990.7777.66 HIT object2 563.43.78 object5 457.8888.1 HIT object2 563.43.78 object13 123.89.7777 HIT object2 563.43.78 object9 1223.333.111 [download] This approach would generalize to any number of distinct "Obj1" things. If you have more than two, you might want to look at groupings other than "found in all Obj1 things" and "found only in a single Obj1 thing" -- that's "left as an exercise..."	[reply] [d/l]