perl_paduan has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have an input file like this:
chr1 12345 34567 gene1 chr1 12345 34567 gene2
I would like to remove duplicate lines when column 1,2 and 3 are the same concatenating column 4. This is the output I would like to have:
chr1 12345 34567 gene1,gene2
Can you help me? Thanks!

Replies are listed 'Best First'.
Re: Duplicate removal
by toolic (Bishop) on Apr 07, 2010 at 16:26 UTC
    You could create a Hash-of-Arrays data structure. The hash keys could be a concatenation of the 1st 3 columns, and the hash values could be an array of all the last column values:
    use strict; use warnings; my %data; while (<DATA>) { my @cols = split; my $col3 = pop @cols; my $key = "@cols"; push @{ $data{$key} }, $col3; } for (keys %data) { print "$_ ", join(',', @{ $data{$_} }), "\n"; } __DATA__ chr1 12345 34567 gene1 chr1 12345 34567 gene2

    Output:

    chr1 12345 34567 gene1,gene2
      Thanks and thanks to umasuresh and toolic!
      Both codes work perfectly!!!
Re: Duplicate removal
by umasuresh (Hermit) on Apr 07, 2010 at 16:15 UTC
    A simple solution:
    use strict; my %chr_hash; while(<DATA>) { chomp; my ($chr, $start, $end, $gene) = split(/\t/, $_); my $chr_key = $chr."_".$start."_".$end; #print "$chr_key\n"; push( @{ $chr_hash{$chr_key} }, $gene ); } foreach my $key (keys %chr_hash) { my ($c, $s, $e) = split(/\_/, $key ); print "$c\t$s\t$e\t"; for my $g ( @ {$chr_hash{$key} } ) { print "$g,"; } print "\n"; } __DATA__ chr1 12345 34567 gene1 chr1 12345 34567 gene2
Re: Duplicate removal
by JavaFan (Canon) on Apr 07, 2010 at 15:53 UTC
    I'd use a hash if the file isn't too big. I'd use a database or a DBM file is the file is big.

    I wonder, what have you tried so far?