Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all!
I want to know how I can select only duplicate entries in a text. For instance, assume the following text sample:
protein1 stomach protein2 head protein3 muscle protein3 heart protein3 brain protein4 leg protein5 toes protein5 mouth protein6 ear
What I want to print is in a separate file the proteins that appear once, and, in another file, the proteins that appear twice, three times etc...
Any ideas?
Thank you!

Replies are listed 'Best First'.
Re: select only duplicate entries
by ikegami (Patriarch) on Aug 24, 2006 at 08:06 UTC

    What have you tried? You haven't demonstrated any effort at solving your own problems. (I presume combine duplicate entries was also posted by you.)

    A hash keyed by protein would be useful. The values would be lists of organs. You can use split to seperate the protein from the organ.

Re: select only duplicate entries
by GrandFather (Saint) on Aug 24, 2006 at 08:30 UTC

    You may find the answers to combine duplicate entries helpful as a starting point. During building the hash take note of the number of elements in the largest array. Then iterate from 1 to number of elements. In each iteration use grep to pull out a list of the arrays containing the data for the file matching that number of elements.


    DWIM is Perl's answer to Gödel
Re: select only duplicate entries
by borisz (Canon) on Aug 24, 2006 at 08:55 UTC
    my %h; while ( defined ( $_ = <DATA> )){ chomp; my ( $k, $v) = split ' '; push @{$h{$k}}, $v; } open my $fh1, '>', '/tmp/1.txt' or die; open my $fh2, '>', '/tmp/2.txt' or die; for my $k ( sort keys %h ) { my $c = @{$h{$k}}; for ( @{$h{$k}}){ $c > 1 ? print $fh2 "$k\t$_\n" : print $fh1 "$k\t$_\n"; }} __DATA__ protein1 stomach protein2 head protein3 muscle protein3 heart protein3 brain protein4 leg protein5 toes protein5 mouth protein6 ear
    Boris

      while ( defined ( $_ = <DATA> )){
      is equivalent to
      while ( <DATA> ){

      $c > 1 ? print $fh2 "$k\t$_\n" : print $fh1 "$k\t$_\n";
      is equivalent to
      print { $c == 1 ? $fh1 : $fh2 } "$k\t$_\n";
      or do
      my $fh = $c == 1 ? $fh1 : $fh2;
      outside the loop and print to $fh.

        Thanks, I know. I try to write it simple for the newbies.
        Boris
Re: select only duplicate entries
by Mandrake (Chaplain) on Aug 24, 2006 at 09:50 UTC
    Try
    #!/usr/bin/perl -w use strict; my %hash; (!/^$/) && (push @{$hash{(split /\s+/,$_)[0]}}, (split /\s+/,$_)[1]) w +hile(<DATA>); open TMP1, '>duplicates.txt' or die; open TMP2, '>distinct.txt' or die; for my $key (keys %hash) { for (@{$hash{$key}}) { (@{$hash{$key}} > 1) ? print TMP1 "$key\t$_\n" : print TMP2 "$key\ +t$_\n" ; } } __DATA__ protein1 stomach protein2 head protein3 muscle protein3 heart protein3 brain protein4 leg protein5 toes protein5 mouth protein6 ear
    Please refer to combine duplicate entries for similar solutions.
    Thanks..