Identifying duplicates in array or hash based on a subset of data

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Identifying duplicates in array or hash based on a subset of data by duyet (Friar) on Aug 17, 2016 at 11:52 UTC
i guess there are many ways to do it, but i was thinking of create a hash using the type and pos as key: `my $hash = {}; foreach my $line ( <DATA> ) { my ( $id, $type, $pos ) = split /\s+/, $line; $hash->{ $id } = { id => $id, type => $type, pos => $pos, }; } my $dup_hash = {}; foreach my $id ( keys %{ $hash } ) { my $type_pos = $hash->{ $id }{type} . '_' . $hash->{ $id }{pos}; $dup_hash->{ $type_pos }{count}++; $dup_hash->{ $type_pos }{id} = $id; }` [download] You can use the $dup_hash to check for duplicates etc. `$dup hash = { '1_10' => { 'count' => 1, 'id' => '1' }, '1_11' => { 'count' => 2, 'id' => '2' }, '1_15' => { 'count' => 1, 'id' => '4' }, '2_5' => { 'count' => 2, 'id' => '5' }, '2_7' => { 'count' => 1, 'id' => '7' } }` [download]	[reply] [d/l] [select]
Re: Identifying duplicates in array or hash based on a subset of data by choroba (Cardinal) on Aug 17, 2016 at 12:19 UTC
You can use hashes of hashes (of hashes): `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; ARGV = DATA{IO} unless @ARGV; my (%all, %duplicates); <>; # Skip header. while (<>) { my ($id, $type, $pos) = split; undef @{ $duplicates{$type}{$pos} }{ $id, $all{$type}{$pos} } if exists $all{$type}{$pos}; $all{$type}{$pos} = $id; } say join ' ', 'Duplicates:', join '; ', map { join ', ', map keys %$_, values %$_ } values %duplicates; __DATA__ ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Identifying duplicates in array or hash based on a subset of data by BillKSmith (Monsignor) on Aug 17, 2016 at 13:08 UTC
Store the raw data in an array of arrays. Store the duplicate information in a hash. Combine type and position to form a single key. The value is a reference to an array of indicies into the raw data. `use strict; use warnings; my @raw_data; my %dups; my $i = -1; <DATA>; # skip header while (my $line = <DATA>) { my ($id, $type, $pos) = split /\s+/, $line; $raw_data[++$i] = [$id, $type, $pos]; my $key = "$type:$pos"; $dups{$key} = [] if !exists $dups{$key}; push @{$dups{$key}}, $i; } foreach my $entry (@raw_data) { my $key = "$entry->[1]:$entry->[2]"; print "@$entry\n" if (@{$dups{$key}} > 1); } __DATA__ ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 OUTPUT: 2 1 11 3 1 11 5 2 5 6 2 5` [download] Bill	[reply] [d/l]
Re: Identifying duplicates in array or hash based on a subset of data by johngg (Canon) on Aug 17, 2016 at 22:18 UTC
Use a HoA with the key being Type & Pos padded with leading zeros and concatenated for easy sorting, the value being an anonymous array onto which the original lines are pushed. Then use grep and sort to get those keys with duplicate lines in ascending Pos within Type order as per the original data and print out. `johngg@shiraz:~/perl/Monks > perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOD or die $!; ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7 EOD my $hdrs = <$inFH>; my %tp; foreach ( <$inFH> ) { my $key = sprintf q{%09d:%09d}, ( split )[ 1, 2 ]; push @{ $tp{ $key } }, $_; } my @dupKeys = sort grep { scalar @{ $tp{ $_ } } > 1 } keys %tp; print @{ $tp{ $_ } } for @dupKeys;' 2 1 11 3 1 11 5 2 5 6 2 5 johngg@shiraz:~/perl/Monks >` [download] I hope this is of interest. Update: ++ Marshall - goodness only knows what I was thinking, `foreach ( <$inFH> )` should of course be `while ( <$inFH> )`. That's what happens when you retire and hardly do any coding for months :-/ Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Identifying duplicates in array or hash based on a subset of data by Marshall (Canon) on Aug 18, 2016 at 00:17 UTC
Hi johngg! I liked your post. I see what you did with the sprintf. The OP may not understand, so I include a demo for him (you already know this) showing why leading zeroes are necessary to get the "right" numeric result with an alpha sort. I didn't see the need for this, but you bring up a valid point if this matters. `#!/usr/bin/perl use warnings; use strict; # Simple alpha sort produces wrong numeric order here my @test = (qw/1 12 10 100 /); @test = sort @test; print "@test\n"; #prints "1 10 100 12" # With leading zero'es, we get "right" numeric answer @test = (qw/001 012 010 100/); @test = sort @test; print "@test\n"; #prints "001 010 012 100"` [download] I have a small quibble with this line: `foreach ( <$inFH> )`. With this syntax, I figure that Perl will construct a list of stuff from `$inFH` and process that list. That will use more memory than a `while (<$inFH>){}` construct which reads one line at a time from the file handle. No biggie for small files, but this matters for "big" files.	[reply] [d/l] [select]
Re: Identifying duplicates in array or hash based on a subset of data by Marshall (Canon) on Aug 17, 2016 at 21:15 UTC
You can store the data as a HoA (Hash of Array). The keys are "type pos" and the value is an array of ids. The number of ids in the array gives you the count. The ids themselves allow the original data line to be reconstructed. The OP said, "Normally I would simply use an incrementing hash to detect duplicate entries, looking for any values >1". The code below is essentially that idea, except rather than incrementing a simple scalar, a new element is pushed onto an array. `#!/usr/bin/perl use strict; use warnings; <DATA>; #throw way first line my %ids; #Hash of Array "$type $pos" => @ids while (<DATA>) { my ($id,$type,$pos) = split; push @{$ids{"$type $pos"}}, $id; } foreach my $key (sort keys %ids) { next if @{$ids{$key}} == 1; foreach my $id (@{$ids{$key}}) { print "$id $key\n"; } } =prints: 2 1 11 3 1 11 5 2 5 6 2 5 =cut __DATA__ ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7` [download]	[reply] [d/l]