tariqahsan has asked for the wisdom of the Perl Monks concerning the following question:

I have a data file with the following data -

123|abc
123|cde
234|efg
456|hij

I want to eliminate one of the duplicate values based on
the first column value

So, the output file should have the following lines -
123|abc
234|efg
456|hij

Any suggestions?

  • Comment on how to remove duplicate based on the first column value

Replies are listed 'Best First'.
Re: how to remove duplicate based on the first column value
by arthas (Hermit) on Jun 10, 2003 at 15:03 UTC

    Try the following code. The output is on STDOUT, but you should have no problem modifying ti to output to a file if you need it. The script uses a %seen hash as lookup table to avoid print out duplicates.

    #!/usr/bin/perl -Tw use strict; open (my $myfile, "./prova.txt"); my %seen; while (<$myfile>) { chomp; my ($c1, $c2) = split(/\|/); unless (defined $seen{$c1}) { print "$c1$c2\n"; $seen{$c1} = 1; } } close ($myfile);

    Hope this helps!

    Michele.

      Michele,
      Thanks! your script work.
      - Tariq
Re: how to remove duplicate based on the first column value
by Enlil (Parson) on Jun 10, 2003 at 15:08 UTC
    Heres one way.
    use strict; use warnings; my %seen; while ( <DATA> ) { print unless $seen{(split /\|/)[0]}++; } __DATA__ 123|abc 123|cde 234|efg 456|hij

    -enlil

      Which could also be done as the following one liner:
      perl -e' while(<>) {print unless $s{(split /\|/)[0]}++;}' < infile > o +utfile
      where infile is a file with the values to parse and outfile is where you want the results.

      Indeed:

      perl -i.bak -e' while(<>) {print unless $s{(split /\|/)[0]}++;}' infil +e
      will edit infile in situ (putting backup in infile.bak).

      Update: I have assumed that you are using a shell with redirection, such as bash. I have been told off for this sort of assumption before so best to make it clear.

      --tidiness is the memory loss of environmental mnemonics

        Another nice perl command line switch is -n. This builds the while(<>) loop for you. Example:

        perl -i.bak -ne 'print unless $s{(split /\|/)[0]}++' infile

        Note: more of these can be found in perlrun.

Re: how to remove duplicate based on the first column value
by cees (Curate) on Jun 10, 2003 at 15:55 UTC
    my %seen; my @data = grep { chomp; not $seen{(split /\|/)[0]}++ } <DATA>;

    This solution will load the entire file in at once, so if you are using large files this would not be the most memory efficient solution.

    Whenever you need to remove elements from a list of items think grep.

Re: how to remove duplicate based on the first column value
by cbro (Pilgrim) on Jun 10, 2003 at 15:05 UTC
    #!/usr/local/bin/perl my %hash; open (F,"testers.txt"); my @array = <F>; close(F); foreach (@array) { my ($key, $value) = split(/\|/); next if (exists $hash{$key}); $hash{$key} = $value; } # use this to verify while (my ($fkey,$fval) = each %hash) { print "$fkey|$fval\n"; }
    I hope I didn't just do somebody's homework <g>
Re: how to remove duplicate based on the first column value
by perlguy (Deacon) on Jun 10, 2003 at 18:08 UTC

    You could also do it with substr():

    my %seen; print join '', grep { !$seen{substr($_, 0, 1)}++ } <DATA>;