Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

I want to remove duplicate based on specific column and value.

by EBK (Sexton)
on May 25, 2018 at 06:14 UTC ( [id://1215189]=perlquestion: print w/replies, xml ) Need Help??

EBK has asked for the wisdom of the Perl Monks concerning the following question:

Hello Guys, I need your help I have no idea where to start. I want to remove duplicate based on specific column and value. This file below. There are a lot of rows.
key_683,march,29031990,ABCD_FH,9,9 key_684,march,29031990,ABCD_FH,9,9 key_685,march,29031990,ABCD_FH,9,9 key_686,march,29031990,ABCD_FH,9,9 key_687,march,29031990,ABCD_FH,9,9 key_688,march,29031990,ABCD_FH,9,9 key_689,march,29031990,ABCD_FH,9,9 key_693,march,29031990,ABCD_FH,9,9 key_694,march,29031990,ABCD_FH,9,9 key_683,march,29031990,ABCD_H + ABC_F,8,17 key_684,march,29031990,ABCD_H + ABC_F,8,17 key_685,march,29031990,ABCD_H + ABC_F,8,17 key_686,march,29031990,ABCD_H + ABC_F,8,17 key_687,march,29031990,ABCD_H + ABC_F,8,17 key_688,march,29031990,ABCD_H + ABC_F,8,17 key_689,march,29031990,ABCD_H + ABC_F,8,17 key_690,march,29031990,ABCD_H + ABC_F,8,17 key_691,march,29031990,ABCD_H + ABC_F,8,17 key_692,march,29031990,ABCD_H + ABC_F,8,17 key_693,march,29031990,ABCD_H + ABC_F,8,17 key_694,march,29031990,ABCD_H + ABC_F,8,17 key_695,march,29031990,ABCD_H + ABC_F,8,17 key_696,march,29031990,ABCD_H + ABC_F,8,17 key_697,march,29031990,ABCD_H + ABC_F,8,17 key_698,march,29031990,ABCD_H + ABC_F,8,17 key_699,march,29031990,ABCD_H + ABC_F,8,17 key_683,march,29031990,ABC_H + AB_F,2,19 key_684,march,29031990,ABC_H + AB_F,2,19 key_685,march,29031990,ABC_H + AB_F,2,19 key_686,march,29031990,ABC_H + AB_F,2,19 key_687,march,29031990,ABC_H + AB_F,2,19 key_688,march,29031990,ABC_H + AB_F,2,19 key_689,march,29031990,ABC_H + AB_F,2,19 key_690,march,29031990,ABC_H + AB_F,2,19 key_691,march,29031990,ABC_H + AB_F,2,19 key_692,march,29031990,ABC_H + AB_F,2,19 key_693,march,29031990,ABC_H + AB_F,2,19 key_694,march,29031990,ABC_H + AB_F,2,19 key_695,march,29031990,ABC_H + AB_F,2,19 key_696,march,29031990,ABC_H + AB_F,2,19 key_697,march,29031990,ABC_H + AB_F,2,19 key_698,march,29031990,ABC_H + AB_F,2,19 key_699,march,29031990,ABC_H + AB_F,2,19 key_700,march,29031990,ABC_H + AB_F,2,19 key_701,march,29031990,ABC_H + AB_F,2,19
The column 0 there is a unique key and I have groups of keys. I need to print the difference between them. For example my group 1 is ABCD_FH. So I want to print 9 positions of group ABCD_FH and reserved them. key_683 up to key_694.
key_683,march,ABCD_FH,9,9 key_684,march,ABCD_FH,9,9 key_685,march,ABCD_FH,9,9 key_686,march,ABCD_FH,9,9 key_687,march,ABCD_FH,9,9 key_688,march,ABCD_FH,9,9 key_689,march,ABCD_FH,9,9 key_693,march,ABCD_FH,9,9 key_694,march,ABCD_FH,9,9
Now is the magic, the group 1 I will use the keys I have reserved and print the leftovers from group 2 - ABCD_H + ABC_F and printing
key_690,march,ABCD_H + ABC_F,8,17 key_691,march,ABCD_H + ABC_F,8,17 key_692,march,ABCD_H + ABC_F,8,17 key_695,march,ABCD_H + ABC_F,8,17 key_696,march,ABCD_H + ABC_F,8,17 key_697,march,ABCD_H + ABC_F,8,17 key_698,march,ABCD_H + ABC_F,8,17 key_699,march,ABCD_H + ABC_F,8,17
And the group 3 the leftovers from group 1 and group 2,
key_700,march,ABC_H + AB_F,2,19 key_701,march,ABC_H + AB_F,2,19
So my final file will be.
key_683,march,ABCD_FH,9,9 key_684,march,ABCD_FH,9,9 key_685,march,ABCD_FH,9,9 key_686,march,ABCD_FH,9,9 key_687,march,ABCD_FH,9,9 key_688,march,ABCD_FH,9,9 key_689,march,ABCD_FH,9,9 key_693,march,ABCD_FH,9,9 key_694,march,ABCD_FH,9,9 key_690,march,ABCD_H + ABC_F,8,17 key_691,march,ABCD_H + ABC_F,8,17 key_692,march,ABCD_H + ABC_F,8,17 key_695,march,ABCD_H + ABC_F,8,17 key_696,march,ABCD_H + ABC_F,8,17 key_697,march,ABCD_H + ABC_F,8,17 key_698,march,ABCD_H + ABC_F,8,17 key_699,march,ABCD_H + ABC_F,8,17 key_700,march,ABC_H + AB_F,2,19 key_701,march,ABC_H + AB_F,2,19

Replies are listed 'Best First'.
Re: I want to remove duplicate based on specific column and value.
by hippo (Bishop) on May 25, 2018 at 09:02 UTC

    With the same principle as brother Veltro, here is a take with some tests (whose number you can happily expand to your satisfaction). It uses Text::CSV_XS and thus allows for less strictly-formed inputs, should they arise. The final output of the normalised dataset is left as a simple exercise.

    #!/usr/bin/env perl use strict; use warnings; use Test::More tests => 3; use Text::CSV_XS; my $csv = Text::CSV_XS->new; my %out; while (my $colref = $csv->getline (*DATA)) { next if exists $out{$colref->[0]}; $out{$colref->[0]} = [@$colref[1,3,4,5]]; } is (scalar (keys %out), 19, 'Have correct number of output lines'); is_deeply ($out{key_683}, ['march', 'ABCD_FH', 9, 9], 'First line is g +ood'); is_deeply ($out{key_701}, ['march', 'ABC_H + AB_F', 2, 19], 'Last line + is good'); __DATA__ key_683,march,29031990,ABCD_FH,9,9 key_684,march,29031990,ABCD_FH,9,9 key_685,march,29031990,ABCD_FH,9,9 key_686,march,29031990,ABCD_FH,9,9 key_687,march,29031990,ABCD_FH,9,9 key_688,march,29031990,ABCD_FH,9,9 key_689,march,29031990,ABCD_FH,9,9 key_693,march,29031990,ABCD_FH,9,9 key_694,march,29031990,ABCD_FH,9,9 key_683,march,29031990,ABCD_H + ABC_F,8,17 key_684,march,29031990,ABCD_H + ABC_F,8,17 key_685,march,29031990,ABCD_H + ABC_F,8,17 key_686,march,29031990,ABCD_H + ABC_F,8,17 key_687,march,29031990,ABCD_H + ABC_F,8,17 key_688,march,29031990,ABCD_H + ABC_F,8,17 key_689,march,29031990,ABCD_H + ABC_F,8,17 key_690,march,29031990,ABCD_H + ABC_F,8,17 key_691,march,29031990,ABCD_H + ABC_F,8,17 key_692,march,29031990,ABCD_H + ABC_F,8,17 key_693,march,29031990,ABCD_H + ABC_F,8,17 key_694,march,29031990,ABCD_H + ABC_F,8,17 key_695,march,29031990,ABCD_H + ABC_F,8,17 key_696,march,29031990,ABCD_H + ABC_F,8,17 key_697,march,29031990,ABCD_H + ABC_F,8,17 key_698,march,29031990,ABCD_H + ABC_F,8,17 key_699,march,29031990,ABCD_H + ABC_F,8,17 key_683,march,29031990,ABC_H + AB_F,2,19 key_684,march,29031990,ABC_H + AB_F,2,19 key_685,march,29031990,ABC_H + AB_F,2,19 key_686,march,29031990,ABC_H + AB_F,2,19 key_687,march,29031990,ABC_H + AB_F,2,19 key_688,march,29031990,ABC_H + AB_F,2,19 key_689,march,29031990,ABC_H + AB_F,2,19 key_690,march,29031990,ABC_H + AB_F,2,19 key_691,march,29031990,ABC_H + AB_F,2,19 key_692,march,29031990,ABC_H + AB_F,2,19 key_693,march,29031990,ABC_H + AB_F,2,19 key_694,march,29031990,ABC_H + AB_F,2,19 key_695,march,29031990,ABC_H + AB_F,2,19 key_696,march,29031990,ABC_H + AB_F,2,19 key_697,march,29031990,ABC_H + AB_F,2,19 key_698,march,29031990,ABC_H + AB_F,2,19 key_699,march,29031990,ABC_H + AB_F,2,19 key_700,march,29031990,ABC_H + AB_F,2,19 key_701,march,29031990,ABC_H + AB_F,2,19
Re: I want to remove duplicate based on specific column and value.
by Veltro (Hermit) on May 25, 2018 at 07:32 UTC

    I created this for you so you have something you can start playing with. I think it already does what you want. It works by creating a hash with keys and it depends on the fact that a hash key has to be unique and you can use 'exists' to see if the key already exists.

    use strict ; use warnings ; while(<DATA>) { chomp ; my @values = split(',', $_) ; # print "values = @values\n" ; processKey( @values ) ; } my $foundKeys = {} ; sub processKey { my $key = $_[0] ; my $col2 = $_[1] ; my $col3 = $_[2] ; my $col4 = $_[3] ; my $col5 = $_[4] ; my $col6 = $_[5] ; if ( exists $foundKeys->{ $key } ) { # skip } else { $foundKeys->{ $key } = 1 ; print "$key $col2 $col3 $col4 $col5 $col6\n" ; } } __DATA__ key_683,march,29031990,ABCD_FH,9,9 key_684,march,29031990,ABCD_FH,9,9 key_685,march,29031990,ABCD_FH,9,9 key_686,march,29031990,ABCD_FH,9,9 key_687,march,29031990,ABCD_FH,9,9 key_688,march,29031990,ABCD_FH,9,9 key_689,march,29031990,ABCD_FH,9,9 key_693,march,29031990,ABCD_FH,9,9 key_694,march,29031990,ABCD_FH,9,9 key_683,march,29031990,ABCD_H + ABC_F,8,17 key_684,march,29031990,ABCD_H + ABC_F,8,17 key_685,march,29031990,ABCD_H + ABC_F,8,17 key_686,march,29031990,ABCD_H + ABC_F,8,17 key_687,march,29031990,ABCD_H + ABC_F,8,17 key_688,march,29031990,ABCD_H + ABC_F,8,17 key_689,march,29031990,ABCD_H + ABC_F,8,17 key_690,march,29031990,ABCD_H + ABC_F,8,17 key_691,march,29031990,ABCD_H + ABC_F,8,17 key_692,march,29031990,ABCD_H + ABC_F,8,17 key_693,march,29031990,ABCD_H + ABC_F,8,17 key_694,march,29031990,ABCD_H + ABC_F,8,17 key_695,march,29031990,ABCD_H + ABC_F,8,17 key_696,march,29031990,ABCD_H + ABC_F,8,17 key_697,march,29031990,ABCD_H + ABC_F,8,17 key_698,march,29031990,ABCD_H + ABC_F,8,17 key_699,march,29031990,ABCD_H + ABC_F,8,17 key_683,march,29031990,ABC_H + AB_F,2,19 key_684,march,29031990,ABC_H + AB_F,2,19 key_685,march,29031990,ABC_H + AB_F,2,19 key_686,march,29031990,ABC_H + AB_F,2,19 key_687,march,29031990,ABC_H + AB_F,2,19 key_688,march,29031990,ABC_H + AB_F,2,19 key_689,march,29031990,ABC_H + AB_F,2,19 key_690,march,29031990,ABC_H + AB_F,2,19 key_691,march,29031990,ABC_H + AB_F,2,19 key_692,march,29031990,ABC_H + AB_F,2,19 key_693,march,29031990,ABC_H + AB_F,2,19 key_694,march,29031990,ABC_H + AB_F,2,19 key_695,march,29031990,ABC_H + AB_F,2,19 key_696,march,29031990,ABC_H + AB_F,2,19 key_697,march,29031990,ABC_H + AB_F,2,19 key_698,march,29031990,ABC_H + AB_F,2,19 key_699,march,29031990,ABC_H + AB_F,2,19 key_700,march,29031990,ABC_H + AB_F,2,19 key_701,march,29031990,ABC_H + AB_F,2,19
Re: I want to remove duplicate based on specific column and value.
by mr_ron (Chaplain) on May 25, 2018 at 14:42 UTC

    This terse approach may hint at future possibilities for text file hackery:

    perl -M5.010 -Mstrict -F, -wanle ' BEGIN{ $" = "," } our ($rpt, $non_key); if ( ($non_key // "") eq "@F[1..5]"){ print; # this line not right - exercise left to reader } else { $rpt = $F[4] -1; $non_key = "@F[1..5]"; print } '
    Ron

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1215189]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-23 15:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found