john.tm has asked for the wisdom of the Perl Monks concerning the following question:

I have a comma seperated file, and wish to remove duplicate lines based on column b,c & d, but keeping the one with the lower value letter ( in my case F not S ) from column E.
input 1,ken,james,smith,s 11,ken,james,smith,f 0,ken,james,smith,s output 11,ken,james,smith,f
seek $ifh, 0, 0; my @file = <$ifh>; my @array; my %hash; foreach my $_ (reverse @file) { chomp; next if ! m/^\s+\d/; s/^\s+//g; s/\s+$//g; s/\s+/,/g; my $key = join ',', ( split /,/ )[ 1, 2, 3 ]; # remove duplicates +column b,c,d #push @array, $_ print $_, "\n" if ! $hash{$key}++; }

Replies are listed 'Best First'.
Re: perl to remove duplacate based on columnb,d &
by blindluke (Hermit) on Jan 01, 2015 at 00:08 UTC

    You could try something like this:

    #!/usr/bin/perl use warnings; use strict; my %uniq; while (<DATA>) { next unless /^\s*\d/; chomp; my $line = $_; my @f = split /,/, $line; my $key = $f[1].$f[2].$f[3]; if ( exists $uniq{$key} ) { my $stored = ( split /,/, $uniq{$key})[4]; my $new = $f[4]; if ($new lt $stored) { $uniq{$key} = $line; } } else { $uniq{$key} = $line; } } print $_."\n" for (values %uniq); __DATA__ 1,ken,james,smith,s 11,ken,james,smith,f 0,ken,james,smith,s 5,ken,arthur,wesson,g 7,ken,arthur,wesson,a

    For the provided DATA section, it produces the following output:

    11,ken,james,smith,f 7,ken,arthur,wesson,a

    Which should be the behavior you want.

    Consider looking at dedicated CSV modules, like Text::CSV_XS.

    It's already 2015 in my time zone, and so I wish you all the best in 2015. May your code produce the output you desire, and your input be as you think it is.

    - Luke

Re: perl to remove duplacate based on columnb,d &
by Anonymous Monk on Jan 01, 2015 at 05:39 UTC

    Hi,

    What should hsppen if you don't get an 'f' line, or if you get 2?

    You may be told that this will never happen. This generally means that it will happen within a coupole of weeks, at most.

    J.C.