in reply to Identifying duplicates in array or hash based on a subset of data

Use a HoA with the key being Type & Pos padded with leading zeros and concatenated for easy sorting, the value being an anonymous array onto which the original lines are pushed. Then use grep and sort to get those keys with duplicate lines in ascending Pos within Type order as per the original data and print out.

johngg@shiraz:~/perl/Monks > perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<EOD or die $!; ID Type Pos 1 1 10 2 1 11 3 1 11 4 1 15 5 2 5 6 2 5 7 2 7 EOD my $hdrs = <$inFH>; my %tp; foreach ( <$inFH> ) { my $key = sprintf q{%09d:%09d}, ( split )[ 1, 2 ]; push @{ $tp{ $key } }, $_; } my @dupKeys = sort grep { scalar @{ $tp{ $_ } } > 1 } keys %tp; print @{ $tp{ $_ } } for @dupKeys;' 2 1 11 3 1 11 5 2 5 6 2 5 johngg@shiraz:~/perl/Monks >

I hope this is of interest.

Update: ++ Marshall - goodness only knows what I was thinking, foreach ( <$inFH> ) should of course be while ( <$inFH> ). That's what happens when you retire and hardly do any coding for months :-/

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: Identifying duplicates in array or hash based on a subset of data
by Marshall (Canon) on Aug 18, 2016 at 00:17 UTC
    Hi johngg! I liked your post. I see what you did with the sprintf. The OP may not understand, so I include a demo for him (you already know this) showing why leading zeroes are necessary to get the "right" numeric result with an alpha sort. I didn't see the need for this, but you bring up a valid point if this matters.
    #!/usr/bin/perl use warnings; use strict; # Simple alpha sort produces wrong numeric order here my @test = (qw/1 12 10 100 /); @test = sort @test; print "@test\n"; #prints "1 10 100 12" # With leading zero'es, we get "right" numeric answer @test = (qw/001 012 010 100/); @test = sort @test; print "@test\n"; #prints "001 010 012 100"
    I have a small quibble with this line: foreach ( <$inFH> ). With this syntax, I figure that Perl will construct a list of stuff from $inFH and process that list. That will use more memory than a while (<$inFH>){} construct which reads one line at a time from the file handle. No biggie for small files, but this matters for "big" files.