EBK has asked for the wisdom of the Perl Monks concerning the following question:
Hello Guys,
I need your help I have no idea where to start.
I want to remove duplicate based on specific column and value.
This file below. There are a lot of rows.
key_683,march,29031990,ABCD_FH,9,9
key_684,march,29031990,ABCD_FH,9,9
key_685,march,29031990,ABCD_FH,9,9
key_686,march,29031990,ABCD_FH,9,9
key_687,march,29031990,ABCD_FH,9,9
key_688,march,29031990,ABCD_FH,9,9
key_689,march,29031990,ABCD_FH,9,9
key_693,march,29031990,ABCD_FH,9,9
key_694,march,29031990,ABCD_FH,9,9
key_683,march,29031990,ABCD_H + ABC_F,8,17
key_684,march,29031990,ABCD_H + ABC_F,8,17
key_685,march,29031990,ABCD_H + ABC_F,8,17
key_686,march,29031990,ABCD_H + ABC_F,8,17
key_687,march,29031990,ABCD_H + ABC_F,8,17
key_688,march,29031990,ABCD_H + ABC_F,8,17
key_689,march,29031990,ABCD_H + ABC_F,8,17
key_690,march,29031990,ABCD_H + ABC_F,8,17
key_691,march,29031990,ABCD_H + ABC_F,8,17
key_692,march,29031990,ABCD_H + ABC_F,8,17
key_693,march,29031990,ABCD_H + ABC_F,8,17
key_694,march,29031990,ABCD_H + ABC_F,8,17
key_695,march,29031990,ABCD_H + ABC_F,8,17
key_696,march,29031990,ABCD_H + ABC_F,8,17
key_697,march,29031990,ABCD_H + ABC_F,8,17
key_698,march,29031990,ABCD_H + ABC_F,8,17
key_699,march,29031990,ABCD_H + ABC_F,8,17
key_683,march,29031990,ABC_H + AB_F,2,19
key_684,march,29031990,ABC_H + AB_F,2,19
key_685,march,29031990,ABC_H + AB_F,2,19
key_686,march,29031990,ABC_H + AB_F,2,19
key_687,march,29031990,ABC_H + AB_F,2,19
key_688,march,29031990,ABC_H + AB_F,2,19
key_689,march,29031990,ABC_H + AB_F,2,19
key_690,march,29031990,ABC_H + AB_F,2,19
key_691,march,29031990,ABC_H + AB_F,2,19
key_692,march,29031990,ABC_H + AB_F,2,19
key_693,march,29031990,ABC_H + AB_F,2,19
key_694,march,29031990,ABC_H + AB_F,2,19
key_695,march,29031990,ABC_H + AB_F,2,19
key_696,march,29031990,ABC_H + AB_F,2,19
key_697,march,29031990,ABC_H + AB_F,2,19
key_698,march,29031990,ABC_H + AB_F,2,19
key_699,march,29031990,ABC_H + AB_F,2,19
key_700,march,29031990,ABC_H + AB_F,2,19
key_701,march,29031990,ABC_H + AB_F,2,19
The column 0 there is a unique key and I have groups of keys.
I need to print the difference between them.
For example my group 1 is ABCD_FH.
So I want to print 9 positions of group ABCD_FH and reserved them. key_683 up to key_694.
key_683,march,ABCD_FH,9,9
key_684,march,ABCD_FH,9,9
key_685,march,ABCD_FH,9,9
key_686,march,ABCD_FH,9,9
key_687,march,ABCD_FH,9,9
key_688,march,ABCD_FH,9,9
key_689,march,ABCD_FH,9,9
key_693,march,ABCD_FH,9,9
key_694,march,ABCD_FH,9,9
Now is the magic, the group 1 I will use the keys I have reserved and print the leftovers from group 2 - ABCD_H + ABC_F and printing
key_690,march,ABCD_H + ABC_F,8,17
key_691,march,ABCD_H + ABC_F,8,17
key_692,march,ABCD_H + ABC_F,8,17
key_695,march,ABCD_H + ABC_F,8,17
key_696,march,ABCD_H + ABC_F,8,17
key_697,march,ABCD_H + ABC_F,8,17
key_698,march,ABCD_H + ABC_F,8,17
key_699,march,ABCD_H + ABC_F,8,17
And the group 3 the leftovers from group 1 and group 2,
key_700,march,ABC_H + AB_F,2,19
key_701,march,ABC_H + AB_F,2,19
So my final file will be.
key_683,march,ABCD_FH,9,9
key_684,march,ABCD_FH,9,9
key_685,march,ABCD_FH,9,9
key_686,march,ABCD_FH,9,9
key_687,march,ABCD_FH,9,9
key_688,march,ABCD_FH,9,9
key_689,march,ABCD_FH,9,9
key_693,march,ABCD_FH,9,9
key_694,march,ABCD_FH,9,9
key_690,march,ABCD_H + ABC_F,8,17
key_691,march,ABCD_H + ABC_F,8,17
key_692,march,ABCD_H + ABC_F,8,17
key_695,march,ABCD_H + ABC_F,8,17
key_696,march,ABCD_H + ABC_F,8,17
key_697,march,ABCD_H + ABC_F,8,17
key_698,march,ABCD_H + ABC_F,8,17
key_699,march,ABCD_H + ABC_F,8,17
key_700,march,ABC_H + AB_F,2,19
key_701,march,ABC_H + AB_F,2,19
Re: I want to remove duplicate based on specific column and value.
by hippo (Bishop) on May 25, 2018 at 09:02 UTC
|
With the same principle as brother Veltro, here is a take with some tests (whose number you can happily expand to your satisfaction). It uses Text::CSV_XS and thus allows for less strictly-formed inputs, should they arise. The final output of the normalised dataset is left as a simple exercise.
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More tests => 3;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new;
my %out;
while (my $colref = $csv->getline (*DATA)) {
next if exists $out{$colref->[0]};
$out{$colref->[0]} = [@$colref[1,3,4,5]];
}
is (scalar (keys %out), 19, 'Have correct number of output lines');
is_deeply ($out{key_683}, ['march', 'ABCD_FH', 9, 9], 'First line is g
+ood');
is_deeply ($out{key_701}, ['march', 'ABC_H + AB_F', 2, 19], 'Last line
+ is good');
__DATA__
key_683,march,29031990,ABCD_FH,9,9
key_684,march,29031990,ABCD_FH,9,9
key_685,march,29031990,ABCD_FH,9,9
key_686,march,29031990,ABCD_FH,9,9
key_687,march,29031990,ABCD_FH,9,9
key_688,march,29031990,ABCD_FH,9,9
key_689,march,29031990,ABCD_FH,9,9
key_693,march,29031990,ABCD_FH,9,9
key_694,march,29031990,ABCD_FH,9,9
key_683,march,29031990,ABCD_H + ABC_F,8,17
key_684,march,29031990,ABCD_H + ABC_F,8,17
key_685,march,29031990,ABCD_H + ABC_F,8,17
key_686,march,29031990,ABCD_H + ABC_F,8,17
key_687,march,29031990,ABCD_H + ABC_F,8,17
key_688,march,29031990,ABCD_H + ABC_F,8,17
key_689,march,29031990,ABCD_H + ABC_F,8,17
key_690,march,29031990,ABCD_H + ABC_F,8,17
key_691,march,29031990,ABCD_H + ABC_F,8,17
key_692,march,29031990,ABCD_H + ABC_F,8,17
key_693,march,29031990,ABCD_H + ABC_F,8,17
key_694,march,29031990,ABCD_H + ABC_F,8,17
key_695,march,29031990,ABCD_H + ABC_F,8,17
key_696,march,29031990,ABCD_H + ABC_F,8,17
key_697,march,29031990,ABCD_H + ABC_F,8,17
key_698,march,29031990,ABCD_H + ABC_F,8,17
key_699,march,29031990,ABCD_H + ABC_F,8,17
key_683,march,29031990,ABC_H + AB_F,2,19
key_684,march,29031990,ABC_H + AB_F,2,19
key_685,march,29031990,ABC_H + AB_F,2,19
key_686,march,29031990,ABC_H + AB_F,2,19
key_687,march,29031990,ABC_H + AB_F,2,19
key_688,march,29031990,ABC_H + AB_F,2,19
key_689,march,29031990,ABC_H + AB_F,2,19
key_690,march,29031990,ABC_H + AB_F,2,19
key_691,march,29031990,ABC_H + AB_F,2,19
key_692,march,29031990,ABC_H + AB_F,2,19
key_693,march,29031990,ABC_H + AB_F,2,19
key_694,march,29031990,ABC_H + AB_F,2,19
key_695,march,29031990,ABC_H + AB_F,2,19
key_696,march,29031990,ABC_H + AB_F,2,19
key_697,march,29031990,ABC_H + AB_F,2,19
key_698,march,29031990,ABC_H + AB_F,2,19
key_699,march,29031990,ABC_H + AB_F,2,19
key_700,march,29031990,ABC_H + AB_F,2,19
key_701,march,29031990,ABC_H + AB_F,2,19
| [reply] [d/l] |
Re: I want to remove duplicate based on specific column and value.
by Veltro (Hermit) on May 25, 2018 at 07:32 UTC
|
I created this for you so you have something you can start playing with. I think it already does what you want. It works by creating a hash with keys and it depends on the fact that a hash key has to be unique and you can use 'exists' to see if the key already exists.
use strict ;
use warnings ;
while(<DATA>) {
chomp ;
my @values = split(',', $_) ;
# print "values = @values\n" ;
processKey( @values ) ;
}
my $foundKeys = {} ;
sub processKey {
my $key = $_[0] ;
my $col2 = $_[1] ;
my $col3 = $_[2] ;
my $col4 = $_[3] ;
my $col5 = $_[4] ;
my $col6 = $_[5] ;
if ( exists $foundKeys->{ $key } ) {
# skip
} else {
$foundKeys->{ $key } = 1 ;
print "$key $col2 $col3 $col4 $col5 $col6\n" ;
}
}
__DATA__
key_683,march,29031990,ABCD_FH,9,9
key_684,march,29031990,ABCD_FH,9,9
key_685,march,29031990,ABCD_FH,9,9
key_686,march,29031990,ABCD_FH,9,9
key_687,march,29031990,ABCD_FH,9,9
key_688,march,29031990,ABCD_FH,9,9
key_689,march,29031990,ABCD_FH,9,9
key_693,march,29031990,ABCD_FH,9,9
key_694,march,29031990,ABCD_FH,9,9
key_683,march,29031990,ABCD_H + ABC_F,8,17
key_684,march,29031990,ABCD_H + ABC_F,8,17
key_685,march,29031990,ABCD_H + ABC_F,8,17
key_686,march,29031990,ABCD_H + ABC_F,8,17
key_687,march,29031990,ABCD_H + ABC_F,8,17
key_688,march,29031990,ABCD_H + ABC_F,8,17
key_689,march,29031990,ABCD_H + ABC_F,8,17
key_690,march,29031990,ABCD_H + ABC_F,8,17
key_691,march,29031990,ABCD_H + ABC_F,8,17
key_692,march,29031990,ABCD_H + ABC_F,8,17
key_693,march,29031990,ABCD_H + ABC_F,8,17
key_694,march,29031990,ABCD_H + ABC_F,8,17
key_695,march,29031990,ABCD_H + ABC_F,8,17
key_696,march,29031990,ABCD_H + ABC_F,8,17
key_697,march,29031990,ABCD_H + ABC_F,8,17
key_698,march,29031990,ABCD_H + ABC_F,8,17
key_699,march,29031990,ABCD_H + ABC_F,8,17
key_683,march,29031990,ABC_H + AB_F,2,19
key_684,march,29031990,ABC_H + AB_F,2,19
key_685,march,29031990,ABC_H + AB_F,2,19
key_686,march,29031990,ABC_H + AB_F,2,19
key_687,march,29031990,ABC_H + AB_F,2,19
key_688,march,29031990,ABC_H + AB_F,2,19
key_689,march,29031990,ABC_H + AB_F,2,19
key_690,march,29031990,ABC_H + AB_F,2,19
key_691,march,29031990,ABC_H + AB_F,2,19
key_692,march,29031990,ABC_H + AB_F,2,19
key_693,march,29031990,ABC_H + AB_F,2,19
key_694,march,29031990,ABC_H + AB_F,2,19
key_695,march,29031990,ABC_H + AB_F,2,19
key_696,march,29031990,ABC_H + AB_F,2,19
key_697,march,29031990,ABC_H + AB_F,2,19
key_698,march,29031990,ABC_H + AB_F,2,19
key_699,march,29031990,ABC_H + AB_F,2,19
key_700,march,29031990,ABC_H + AB_F,2,19
key_701,march,29031990,ABC_H + AB_F,2,19
| [reply] [d/l] |
Re: I want to remove duplicate based on specific column and value.
by mr_ron (Chaplain) on May 25, 2018 at 14:42 UTC
|
perl -M5.010 -Mstrict -F, -wanle '
BEGIN{ $" = "," }
our ($rpt, $non_key);
if ( ($non_key // "") eq "@F[1..5]"){
print; # this line not right - exercise left to reader
}
else {
$rpt = $F[4] -1;
$non_key = "@F[1..5]";
print
}
'
| [reply] [d/l] [select] |
|
|