How to group by a column and calculate max/min on another

perl_paduan has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a file like this:

chrX    2680092    2744539    XG    1
chrX    2680090    2744529    XG    2
chrX    2680080    2744519    XG    3
chrX    2680070    2744509    XG    4
chrX    2680070    2744509    DT    1
chrX    2680090    2744519    DT    2
[download]

I want to obtain as a result a file like this:

chrX    2680070    2744539    XG    4
chrX    2680070    2744519    DT    2
[download]

So basically I need to group by column 1 and 4, and obtain min value for column 2, max value for column 3 and max value for column 5. I've tried with this code:

#!/usr/bin/perl -w
use strict;
use List::Util qw(max);
use List::Util qw(min);

my $input0 = $ARGV[0];
open (DATA,$input0) || die "cannot open input0";

my %gene_hash;
while(<DATA>)
{
    chomp;
    my ($chr, $start, $end, $gene, $ex) = split(/\t/, $_);
    my $gene_key = $chr.":".$gene;
    
    push( @{ $gene_hash{$gene_key} }, $start ); 
    
    push( @{ $gene_hash{$gene_key} }, $end ); 
   
    push( @{ $gene_hash{$gene_key} }, $ex ); 

}



foreach my $key (keys %gene_hash)
{
    my ($c, $g) = split(/\:/, $key ); 
    print "$c\t$g\t";


   my $Low=min( @ {$gene_hash{$key} } );   
  
   my $High=max( @ {$gene_hash{$key} } );
  
   my $High_ex=max( @ {$gene_hash{$key} } );

   

    
    {
    
        print "$Low\t$High\t$High_ex";
    }
    
    print "\n";
}

__DATA__
[download]

but I don't know how to create different arrays for the hash so basically I push in the same array all the values... Obviously the result is a mess:

chrX    XG    1    2744539    2744539
chrX    DT    1    2744519    2744519
[download]

Can you help me?

Many thanks!

Comment on How to group by a column and calculate max/min on another Select or Download Code

Replies are listed 'Best First'.
Re: How to group by a column and calculate max/min on another by roboticus (Chancellor) on Aug 03, 2010 at 13:48 UTC
perl_paduan It appears that you want to have three lists for each hash key, but your code is intermingling the data into a single list. You could split the lists out by name like this: `push( @{ $gene_hash{$gene_key}{min_start} }, $start ); push( @{ $gene_hash{$gene_key}{max_end} }, $end ); push( @{ $gene_hash{$gene_key}{max_ex} }, $ex );` [download] Then, of course, you'd have to change your report loop a little: `my $Low=min( @ {$gene_hash{$key}{min_start} } ); my $High=max( @ {$gene_hash{$key}{max_end} } ); my $High_ex=max( @ {$gene_hash{$key}{max_ex} } );` [download] However, since your report appears to give only one line per CHR:GENE, I'd suggest doing your data aggregation while reading, rather than trying to do it in your output loop. Also, your data structure uses hardcoded array positions for different data elements, which will be confusing when you change your program. So I'd suggest using a HoH, where the second level of your hash contains the name of the item you're aggregating, something like this (untested): `while (<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split(/\t/, $_); my $gene_key = $chr.":".$gene; my $cur_gene = $gene_hash{$gene_key}; $$cur_gene{min_start} = $start unless $$cur_gene{min_start}<$start +; $$cur_gene{max_end} = $end unless $$cur_gene{max_end}>$end; $$cur_gene{max_ex} = $ex unless $$cur_gene{max_ex}>$ex; $gene_hash{$gene_key} = $cur_gene; }` [download] This code, as it reads each CHR:GENE combination will update the min_start, max_end and max_ex values. This way, at the end of your input loop, you already have the values you want, and you can simply retrieve your values in the output loop like this: `my $Low=$gene_hash{$key}{min_start}; my $High=$gene_hash{$key}{max_end}; my $High_ex=$gene_hash{$key}{max_ex};` [download] ...roboticus	[reply] [d/l] [select]
Re^2: How to group by a column and calculate max/min on another by perl_paduan (Initiate) on Aug 03, 2010 at 14:02 UTC
Many thanks roboticus your answer is very exhaustive! really helpful!	[reply]
Re: How to group by a column and calculate max/min on another by toolic (Bishop) on Aug 03, 2010 at 13:47 UTC
If you structure your data a little differently, it will be easier to extract what you need: use strict; use warnings; use List::Util qw(max min); my %gene_hash; while (<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split /\t/; push @{ $gene_hash{$chr}{$gene}{start} }, $start; push @{ $gene_hash{$chr}{$gene}{end } }, $end; push @{ $gene_hash{$chr}{$gene}{ex } }, $ex; } for my $chr (keys %gene_hash) { for my $gene (keys %{ $gene_hash{$chr} }) { my $Low = min( @ { $gene_hash{$chr}{$gene}{start} } ); my $High = max( @ { $gene_hash{$chr}{$gene}{end } } ); my $High_ex = max( @ { $gene_hash{$chr}{$gene}{ex } } ); print "$chr $Low $High $gene $High_ex\n"; } } __DATA__ chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2 [download] prints out: `chrX 2680070 2744539 XG 4 chrX 2680070 2744519 DT 2` [download]	[reply] [d/l] [select]
Re^2: How to group by a column and calculate max/min on another by perl_paduan (Initiate) on Aug 03, 2010 at 13:57 UTC
Thank you toolic! It works perfectly! Cheers, perl_paduan	[reply]
Re: How to group by a column and calculate max/min on another by jethro (Monsignor) on Aug 03, 2010 at 13:55 UTC
Thanks to autovivication you could just add a hash or array inbetween. Here for example a hash: `push( @{ $gene_hash{$gene_key}{'start'} }, $start ); push( @{ $gene_hash{$gene_key}{'end'} }, $end ); .... my $Lowstart=min( @{$gene_hash{$key}{'start'} } );` [download] By the way, if the data files are really big, it might be beneficial to find the minimums and maximums directly instead of first storing arrays of all the values. I.e. instead of pushing the value into an array while reading the file compare it to the last value stored there and if lower replace it. That way you store only three values per key instead of three arrays.	[reply] [d/l]
Re^2: How to group by a column and calculate max/min on another by perl_paduan (Initiate) on Aug 03, 2010 at 14:08 UTC
Thank you jethro!	[reply]
Re: How to group by a column and calculate max/min on another by kennethk (Abbot) on Aug 03, 2010 at 14:00 UTC
Your logic is not too far off. One way to deal with your issue with minor changes to code would be to add a layer of depth to hash of arrays: #!/usr/bin/perl -w use strict; use List::Util qw(max); use List::Util qw(min); #my $input0 = $ARGV[0]; #open (DATA,$input0) \|\| die "cannot open input0"; my %gene_hash; while(<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split(/\t/, $_); my $gene_key = $chr.":".$gene; push( @{ $gene_hash{$gene_key}{start} }, $start ); push( @{ $gene_hash{$gene_key}{end} }, $end ); push( @{ $gene_hash{$gene_key}{ex} }, $ex ); } foreach my $key (keys %gene_hash) { my ($c, $g) = split(/\:/, $key ); print "$c\t$g\t"; my $Low=min( @ {$gene_hash{$key}{start} } ); my $High=max( @ {$gene_hash{$key}{end} } ); my $High_ex=max( @ {$gene_hash{$key}{ex} } ); { print "$Low\t$High\t$High_ex"; } print "\n"; } __DATA__ chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2 [download] If the modification is unclear, you can use Data::Dumper to output the resultant structure by adding the following to the end of your script: `use Data::Dumper; print Dumper \%gene_hash;` [download] A couple of minor things you may consider in addition: You should probably get into the habit of using 3-argument open instead of 2-argument open; the difference is explained in perlopentut. You might also consider swapping to Indirect Filehandles. This can become important in larger projects. split acts on `$_` if no argument is given, so you could change that call on line 13 to `= split(/\t/);` You delimit your keys with `':'`; if you are going to create an amalgam key, you should use a character that is guaranteed not to appear in your file - might I suggest `"\t"`? That way you don't have to split it again for output.	[reply] [d/l] [select]
Re^2: How to group by a column and calculate max/min on another by perl_paduan (Initiate) on Aug 03, 2010 at 14:06 UTC
Thanks and thanks kennethk! I will try to apply all your suggestions to my scripting	[reply]
Re: How to group by a column and calculate max/min on another by suhailck (Friar) on Aug 03, 2010 at 14:51 UTC
This is my attempt, use strict; use List::Util qw(min max); my %hash=(); while(<DATA>) { my @fields=split /\s+/; my $i=0; push @{${$hash{$fields[0]}{$fields[3]}}->[$i++]},$_ for @fields[1,2,4] +; } foreach my $hkey1 (keys %hash) { foreach my $hkey2 (keys %{$hash{$hkey1}}) { print join "\t",$hkey1,$hkey2,min(@{${$hash{$hkey1}{$hkey2}}-> +[0]}),max(@{${$hash{$hkey1}{$hkey2}}->[1]}),max(@{${$hash{$hkey1}{$hk +ey2}}->[2]}),"\n";; } } __END__ chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2 [download] ~suhail	[reply] [d/l]