Duplicate removal

perl_paduan has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have an input file like this:

chr1 12345 34567 gene1 
chr1 12345 34567 gene2
[download]

I would like to remove duplicate lines when column 1,2 and 3 are the same concatenating column 4. This is the output I would like to have:

chr1 12345 34567 gene1,gene2
[download]

Can you help me? Thanks!

Comment on Duplicate removal Select or Download Code

Replies are listed 'Best First'.
Re: Duplicate removal by toolic (Bishop) on Apr 07, 2010 at 16:26 UTC
You could create a Hash-of-Arrays data structure. The hash keys could be a concatenation of the 1st 3 columns, and the hash values could be an array of all the last column values: `use strict; use warnings; my %data; while (<DATA>) { my @cols = split; my $col3 = pop @cols; my $key = "@cols"; push @{ $data{$key} }, $col3; } for (keys %data) { print "$_ ", join(',', @{ $data{$_} }), "\n"; } __DATA__ chr1 12345 34567 gene1 chr1 12345 34567 gene2` [download] Output: `chr1 12345 34567 gene1,gene2` [download]	[reply] [d/l] [select]
Re^2: Duplicate removal by perl_paduan (Initiate) on Apr 07, 2010 at 16:38 UTC
Thanks and thanks to umasuresh and toolic! Both codes work perfectly!!!	[reply]
Re: Duplicate removal by umasuresh (Hermit) on Apr 07, 2010 at 16:15 UTC
A simple solution: `use strict; my %chr_hash; while(<DATA>) { chomp; my ($chr, $start, $end, $gene) = split(/\t/, $_); my $chr_key = $chr."_".$start."_".$end; #print "$chr_key\n"; push( @{ $chr_hash{$chr_key} }, $gene ); } foreach my $key (keys %chr_hash) { my ($c, $s, $e) = split(/\_/, $key ); print "$c\t$s\t$e\t"; for my $g ( @ {$chr_hash{$key} } ) { print "$g,"; } print "\n"; } __DATA__ chr1 12345 34567 gene1 chr1 12345 34567 gene2` [download]	[reply] [d/l]
Re: Duplicate removal by JavaFan (Canon) on Apr 07, 2010 at 15:53 UTC
I'd use a hash if the file isn't too big. I'd use a database or a DBM file is the file is big. I wonder, what have you tried so far?	[reply]