Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of split data-lines read in from <$IN> in this generalised format:

ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7

These can either be pushed into an array or used to create hash keys (i'm not sure which one suits the task at hand more). My aim is to identify lines that are duplicates based only on Type + Pos, for example from the above data set:

2 1 11 3 1 11 5 2 5 6 2 5

Are duplicates. Once a duplicate is identified, the unique IDs of the two offending lines are stored for use later on in the script. Normally I would simply use an incrementing hash to detect duplicate entries, looking for any values >1 - but now that I have the ID column to handle as well, i'm not sure how to do this. Could anybody share any code or tips on how to accomplish this? Thanks!

Replies are listed 'Best First'.
Re: Identifying duplicates in array or hash based on a subset of data
by duyet (Friar) on Aug 17, 2016 at 11:52 UTC
    i guess there are many ways to do it, but i was thinking of create a hash using the type and pos as key:
    my $hash = {}; foreach my $line ( <DATA> ) { my ( $id, $type, $pos ) = split /\s+/, $line; $hash->{ $id } = { id => $id, type => $type, pos => $pos, }; } my $dup_hash = {}; foreach my $id ( keys %{ $hash } ) { my $type_pos = $hash->{ $id }{type} . '_' . $hash->{ $id }{pos}; $dup_hash->{ $type_pos }{count}++; $dup_hash->{ $type_pos }{id} = $id; }
    You can use the $dup_hash to check for duplicates etc.
    $dup hash = { '1_10' => { 'count' => 1, 'id' => '1' }, '1_11' => { 'count' => 2, 'id' => '2' }, '1_15' => { 'count' => 1, 'id' => '4' }, '2_5' => { 'count' => 2, 'id' => '5' }, '2_7' => { 'count' => 1, 'id' => '7' } }
Re: Identifying duplicates in array or hash based on a subset of data
by choroba (Cardinal) on Aug 17, 2016 at 12:19 UTC
    You can use hashes of hashes (of hashes):
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; *ARGV = *DATA{IO} unless @ARGV; my (%all, %duplicates); <>; # Skip header. while (<>) { my ($id, $type, $pos) = split; undef @{ $duplicates{$type}{$pos} }{ $id, $all{$type}{$pos} } if exists $all{$type}{$pos}; $all{$type}{$pos} = $id; } say join ' ', 'Duplicates:', join '; ', map { join ', ', map keys %$_, values %$_ } values %duplicates; __DATA__ ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Identifying duplicates in array or hash based on a subset of data
by BillKSmith (Monsignor) on Aug 17, 2016 at 13:08 UTC
    Store the raw data in an array of arrays. Store the duplicate information in a hash. Combine type and position to form a single key. The value is a reference to an array of indicies into the raw data.
    use strict; use warnings; my @raw_data; my %dups; my $i = -1; <DATA>; # skip header while (my $line = <DATA>) { my ($id, $type, $pos) = split /\s+/, $line; $raw_data[++$i] = [$id, $type, $pos]; my $key = "$type:$pos"; $dups{$key} = [] if !exists $dups{$key}; push @{$dups{$key}}, $i; } foreach my $entry (@raw_data) { my $key = "$entry->[1]:$entry->[2]"; print "@$entry\n" if (@{$dups{$key}} > 1); } __DATA__ ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 OUTPUT: 2 1 11 3 1 11 5 2 5 6 2 5
    Bill
Re: Identifying duplicates in array or hash based on a subset of data
by johngg (Canon) on Aug 17, 2016 at 22:18 UTC

    Use a HoA with the key being Type & Pos padded with leading zeros and concatenated for easy sorting, the value being an anonymous array onto which the original lines are pushed. Then use grep and sort to get those keys with duplicate lines in ascending Pos within Type order as per the original data and print out.

    johngg@shiraz:~/perl/Monks > perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOD or die $!; ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7 EOD my $hdrs = <$inFH>; my %tp; foreach ( <$inFH> ) { my $key = sprintf q{%09d:%09d}, ( split )[ 1, 2 ]; push @{ $tp{ $key } }, $_; } my @dupKeys = sort grep { scalar @{ $tp{ $_ } } > 1 } keys %tp; print @{ $tp{ $_ } } for @dupKeys;' 2 1 11 3 1 11 5 2 5 6 2 5 johngg@shiraz:~/perl/Monks >

    I hope this is of interest.

    Update: ++ Marshall - goodness only knows what I was thinking, foreach ( <$inFH> ) should of course be while ( <$inFH> ). That's what happens when you retire and hardly do any coding for months :-/

    Cheers,

    JohnGG

      Hi johngg! I liked your post. I see what you did with the sprintf. The OP may not understand, so I include a demo for him (you already know this) showing why leading zeroes are necessary to get the "right" numeric result with an alpha sort. I didn't see the need for this, but you bring up a valid point if this matters.
      #!/usr/bin/perl use warnings; use strict; # Simple alpha sort produces wrong numeric order here my @test = (qw/1 12 10 100 /); @test = sort @test; print "@test\n"; #prints "1 10 100 12" # With leading zero'es, we get "right" numeric answer @test = (qw/001 012 010 100/); @test = sort @test; print "@test\n"; #prints "001 010 012 100"
      I have a small quibble with this line: foreach ( <$inFH> ). With this syntax, I figure that Perl will construct a list of stuff from $inFH and process that list. That will use more memory than a while (<$inFH>){} construct which reads one line at a time from the file handle. No biggie for small files, but this matters for "big" files.
Re: Identifying duplicates in array or hash based on a subset of data
by Marshall (Canon) on Aug 17, 2016 at 21:15 UTC
    You can store the data as a HoA (Hash of Array). The keys are "type pos" and the value is an array of ids. The number of ids in the array gives you the count. The ids themselves allow the original data line to be reconstructed.

    The OP said, "Normally I would simply use an incrementing hash to detect duplicate entries, looking for any values >1". The code below is essentially that idea, except rather than incrementing a simple scalar, a new element is pushed onto an array.

    #!/usr/bin/perl use strict; use warnings; <DATA>; #throw way first line my %ids; #Hash of Array "$type $pos" => @ids while (<DATA>) { my ($id,$type,$pos) = split; push @{$ids{"$type $pos"}}, $id; } foreach my $key (sort keys %ids) { next if @{$ids{$key}} == 1; foreach my $id (@{$ids{$key}}) { print "$id $key\n"; } } =prints: 2 1 11 3 1 11 5 2 5 6 2 5 =cut __DATA__ ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7