perl_paduan has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a file like this:
chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2
I want to obtain as a result a file like this:
chrX 2680070 2744539 XG 4 chrX 2680070 2744519 DT 2
So basically I need to group by column 1 and 4, and obtain min value for column 2, max value for column 3 and max value for column 5. I've tried with this code:
#!/usr/bin/perl -w use strict; use List::Util qw(max); use List::Util qw(min); my $input0 = $ARGV[0]; open (DATA,$input0) || die "cannot open input0"; my %gene_hash; while(<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split(/\t/, $_); my $gene_key = $chr.":".$gene; push( @{ $gene_hash{$gene_key} }, $start ); push( @{ $gene_hash{$gene_key} }, $end ); push( @{ $gene_hash{$gene_key} }, $ex ); } foreach my $key (keys %gene_hash) { my ($c, $g) = split(/\:/, $key ); print "$c\t$g\t"; my $Low=min( @ {$gene_hash{$key} } ); my $High=max( @ {$gene_hash{$key} } ); my $High_ex=max( @ {$gene_hash{$key} } ); { print "$Low\t$High\t$High_ex"; } print "\n"; } __DATA__
but I don't know how to create different arrays for the hash so basically I push in the same array all the values... Obviously the result is a mess:
chrX XG 1 2744539 2744539 chrX DT 1 2744519 2744519
Can you help me?

Many thanks!

Replies are listed 'Best First'.
Re: How to group by a column and calculate max/min on another
by roboticus (Chancellor) on Aug 03, 2010 at 13:48 UTC

    perl_paduan

    It appears that you want to have three lists for each hash key, but your code is intermingling the data into a single list. You could split the lists out by name like this:

    push( @{ $gene_hash{$gene_key}{min_start} }, $start ); push( @{ $gene_hash{$gene_key}{max_end} }, $end ); push( @{ $gene_hash{$gene_key}{max_ex} }, $ex );

    Then, of course, you'd have to change your report loop a little:

    my $Low=min( @ {$gene_hash{$key}{min_start} } ); my $High=max( @ {$gene_hash{$key}{max_end} } ); my $High_ex=max( @ {$gene_hash{$key}{max_ex} } );

    However, since your report appears to give only one line per CHR:GENE, I'd suggest doing your data aggregation while reading, rather than trying to do it in your output loop. Also, your data structure uses hardcoded array positions for different data elements, which will be confusing when you change your program. So I'd suggest using a HoH, where the second level of your hash contains the name of the item you're aggregating, something like this (untested):

    while (<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split(/\t/, $_); my $gene_key = $chr.":".$gene; my $cur_gene = $gene_hash{$gene_key}; $$cur_gene{min_start} = $start unless $$cur_gene{min_start}<$start +; $$cur_gene{max_end} = $end unless $$cur_gene{max_end}>$end; $$cur_gene{max_ex} = $ex unless $$cur_gene{max_ex}>$ex; $gene_hash{$gene_key} = $cur_gene; }

    This code, as it reads each CHR:GENE combination will update the min_start, max_end and max_ex values. This way, at the end of your input loop, you already have the values you want, and you can simply retrieve your values in the output loop like this:

    my $Low=$gene_hash{$key}{min_start}; my $High=$gene_hash{$key}{max_end}; my $High_ex=$gene_hash{$key}{max_ex};

    ...roboticus

      Many thanks roboticus

      your answer is very exhaustive!

      really helpful!
Re: How to group by a column and calculate max/min on another
by toolic (Bishop) on Aug 03, 2010 at 13:47 UTC
    If you structure your data a little differently, it will be easier to extract what you need:
    use strict; use warnings; use List::Util qw(max min); my %gene_hash; while (<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split /\t/; push @{ $gene_hash{$chr}{$gene}{start} }, $start; push @{ $gene_hash{$chr}{$gene}{end } }, $end; push @{ $gene_hash{$chr}{$gene}{ex } }, $ex; } for my $chr (keys %gene_hash) { for my $gene (keys %{ $gene_hash{$chr} }) { my $Low = min( @ { $gene_hash{$chr}{$gene}{start} } ); my $High = max( @ { $gene_hash{$chr}{$gene}{end } } ); my $High_ex = max( @ { $gene_hash{$chr}{$gene}{ex } } ); print "$chr $Low $High $gene $High_ex\n"; } } __DATA__ chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2

    prints out:

    chrX 2680070 2744539 XG 4 chrX 2680070 2744519 DT 2
      Thank you toolic! It works perfectly!

      Cheers,

      perl_paduan

Re: How to group by a column and calculate max/min on another
by jethro (Monsignor) on Aug 03, 2010 at 13:55 UTC

    Thanks to autovivication you could just add a hash or array inbetween. Here for example a hash:

    push( @{ $gene_hash{$gene_key}{'start'} }, $start ); push( @{ $gene_hash{$gene_key}{'end'} }, $end ); .... my $Lowstart=min( @{$gene_hash{$key}{'start'} } );

    By the way, if the data files are really big, it might be beneficial to find the minimums and maximums directly instead of first storing arrays of all the values. I.e. instead of pushing the value into an array while reading the file compare it to the last value stored there and if lower replace it. That way you store only three values per key instead of three arrays.

      Thank you jethro!
Re: How to group by a column and calculate max/min on another
by kennethk (Abbot) on Aug 03, 2010 at 14:00 UTC
    Your logic is not too far off. One way to deal with your issue with minor changes to code would be to add a layer of depth to hash of arrays:

    #!/usr/bin/perl -w use strict; use List::Util qw(max); use List::Util qw(min); #my $input0 = $ARGV[0]; #open (DATA,$input0) || die "cannot open input0"; my %gene_hash; while(<DATA>) { chomp; my ($chr, $start, $end, $gene, $ex) = split(/\t/, $_); my $gene_key = $chr.":".$gene; push( @{ $gene_hash{$gene_key}{start} }, $start ); push( @{ $gene_hash{$gene_key}{end} }, $end ); push( @{ $gene_hash{$gene_key}{ex} }, $ex ); } foreach my $key (keys %gene_hash) { my ($c, $g) = split(/\:/, $key ); print "$c\t$g\t"; my $Low=min( @ {$gene_hash{$key}{start} } ); my $High=max( @ {$gene_hash{$key}{end} } ); my $High_ex=max( @ {$gene_hash{$key}{ex} } ); { print "$Low\t$High\t$High_ex"; } print "\n"; } __DATA__ chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2
    If the modification is unclear, you can use Data::Dumper to output the resultant structure by adding the following to the end of your script:

    use Data::Dumper; print Dumper \%gene_hash;

    A couple of minor things you may consider in addition:

    1. You should probably get into the habit of using 3-argument open instead of 2-argument open; the difference is explained in perlopentut.
    2. You might also consider swapping to Indirect Filehandles. This can become important in larger projects.
    3. split acts on $_ if no argument is given, so you could change that call on line 13 to = split(/\t/);
    4. You delimit your keys with ':'; if you are going to create an amalgam key, you should use a character that is guaranteed not to appear in your file - might I suggest "\t"? That way you don't have to split it again for output.
      Thanks and thanks kennethk!

      I will try to apply all your suggestions to my scripting

Re: How to group by a column and calculate max/min on another
by suhailck (Friar) on Aug 03, 2010 at 14:51 UTC
    This is my attempt,

    use strict; use List::Util qw(min max); my %hash=(); while(<DATA>) { my @fields=split /\s+/; my $i=0; push @{${$hash{$fields[0]}{$fields[3]}}->[$i++]},$_ for @fields[1,2,4] +; } foreach my $hkey1 (keys %hash) { foreach my $hkey2 (keys %{$hash{$hkey1}}) { print join "\t",$hkey1,$hkey2,min(@{${$hash{$hkey1}{$hkey2}}-> +[0]}),max(@{${$hash{$hkey1}{$hkey2}}->[1]}),max(@{${$hash{$hkey1}{$hk +ey2}}->[2]}),"\n";; } } __END__ chrX 2680092 2744539 XG 1 chrX 2680090 2744529 XG 2 chrX 2680080 2744519 XG 3 chrX 2680070 2744509 XG 4 chrX 2680070 2744509 DT 1 chrX 2680090 2744519 DT 2


    ~suhail