umasuresh has asked for the wisdom of the Perl Monks concerning the following question:


Hi Monks,
In a recent data munging challenge, I had to calculate the average score of each unique key with multiple scores and output a tab deilimited file with the unique key, number of occurrences and their average scores. The input file is similar to the one listed below with unique key on column1 and scores in column2.
-------inp.txt-------- 1 196 1 190 1 200 2 20 3 25 3 19 3 39 4 40 4 41 4 45
Below is the code that does this. My question is as follows:
1. Is there simpler solution to this task? Will there be a memory problem for this code if the input file contains close to 300,000 lines?

2. Is there a one liner that can achieve the same?
use strict; use warnings; my $in_file = 'inp.txt'; open (IN, $in_file)|| die "couldn't open the in_file\n"; my %r; while (<IN>) { chomp; my $line = $_; my @f = split(/\t/, $line); push(@{ $r{$f[0]} }, $f[1]); } close IN; my @scores =(); my $count; my $i; foreach my $key (sort keys %r) { #print "$key\t"; @scores = @{ $r{$key} }; $count = scalar @scores; #print "@scores[0..$count]\n"; my $sum = 0; #for($i=0; $i <=@scores; $i++) for $i (0..$count) { $sum += $scores[$i]; } my $avg = ($sum/$count); print "$key\t$count\t$avg\n"; }
-US

Replies are listed 'Best First'.
Re: Data munging
by ikegami (Patriarch) on Jan 22, 2010 at 00:41 UTC

    The general idea is right, but you could simplify your average calculation.

    use List::Util qw( sum ); my %data; while (<>) { chomp; my ($k, $v) = split /\t/; push @{ $data{$k} }, $v; } local $, = "\t"; local $\ = "\n"; for my $k (keys %data) { my $data = $data{$k}; print $k, 0+@$data, sum(@$data)/@$data; }

    Memory usage shouldn't be a problem with 300,000 lines, but you could reduce mem usage by summing and counting the elements as you go along.

    my %data; while (<>) { chomp; my ($k, $v) = split /\t/; $data{$k}[0]++ $data{$k}[1]+= $v; } local $, = "\t"; local $\ = "\n"; for my $k (keys %data) { my $data = $data{$k}; print $k, $data->[0], $data->[1]/$data->[0]; }

    If the keys are sorted (or at least grouped) in the input, you could reduce memory usage to something constant.

    my $last; my $sum; my $count; local $, = "\t"; local $\ = "\n"; while (<>) { chomp; my ($k, $v) = split /\t/; if (defined($last) && $k ne $last) { print $last, $count, $sum/$count; ($last, $count, $sum) = ($k, 0, 0); } $count++ $sum += $v; } if (defined($last)) { print $last, $sum/$count; }

    As a one-liner, how about

    perl -lane' $d{$F[0]}[0]++ $d{$F[0]}[1]+= $F[1]; }{ $, = "\t"; print $_, $d{$_}[0], $d{$_}[1]/$d{$_}[0] for keys %d; '

    It can be shortened, but any simpler will affect readability.

    Update: I wasn't printing out the count. Fixed.


      Thanks much ikegami. The first two options are awesome, I am still trying to wrap my brain around the third!
        It counts lines and maintains a sum of the values seen to date. When the key changes, it prints the average, then resets the line count and the sum.
        1 196 -> count = 1 sum = 196 1 190 -> count = 2 sum = 196+190 1 200 -> count = 3 sum = 196+190+200 key changed, so print average, and reset count and sum 2 20 -> count = 1 sum = 20 key changed, so print average, and reset count and sum 3 25 -> count = 1 sum = 25 3 19 -> count = 2 sum = 25+19 3 39 -> count = 3 sum = 25+19+39 key changed, so print average, and reset count and sum 4 40 -> count = 1 sum = 40 4 41 -> count = 2 sum = 40+41 4 45 -> count = 3 sum = 40+41+45 eof, so print average
Re: Data munging
by toolic (Bishop) on Jan 22, 2010 at 00:49 UTC
    Accumulate the data as you read in the file:
    use warnings; use strict; my %r; while (<DATA>) { chomp; my ($k, $s) = split; $r{$k}{count}++; $r{$k}{sum} += $s; } for my $key (sort keys %r) { print "$key\t$r{$key}{count}\t", ($r{$key}{sum}/$r{$key}{count}), +"\n"; } __DATA__ 1 196 1 190 1 200 2 20 3 25 3 19 3 39 4 40 4 41
Re: Data munging
by BrowserUk (Patriarch) on Jan 22, 2010 at 00:51 UTC

    Code wrapped for clarity:

    perl -anle"++$h{$F[0]}[0];$h{$F[0]}[1]+=$F[1]} {print qq[$_\t$h{$_}[0]\t],$h{$_}[1]/$h{$_}[0] for keys %h" munge.txt 4 3 42 1 3 195.333333333333 3 3 27.6666666666667 2 1 20

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi Monks, Thanks much for all the valuable suggestions. I have another challenge at hand and greatly appreciate any input. I have to compare a query text file against a reference text file and do the following:

      1. match the common keys with multiple values representing fragments of each key.
      2. for each key I compare the start and end columns of each fragment (column1 & column2).
      3. for each key if the fragments overlap between the query and reference keep count of the overlaps.
      4. report the query sequence and insert the overlap count in column3, retaining all the other columns.

      Note: The reference and query files are really large running >= 300,000 lines.
      This post is really long, so please bear with me! My questions are below:
      1. I feel that the code can be much simpler than what I have!
      2. Again memory issue is a concern for large files!
      Here are the reference and query text files:
      ------ref.txt--------- c1 120 134 - AG c2 120 134 + TC c1 130 300 - AA c1 12 13000 - AU c9 14000 14008 - GN c9 900 1200 - GX c10 10040 10050 + GG c1 19992 20005 - GG c1 12000 14000 + TT -----query.txt-------- c1 100 12000 + AT c1 19800 20000 - AG c1 20049 20800 - GC c9 10078 14008 - AG c11 10078 14008 - TG c15 10078 14008 - TC c9 1078 10008 - TA c10 10080 10000 - TT ------code------------- use strict; use warnings; use Data::Dumper; # if (@ARGV < 1) # { # print "Usage: $0 inputDir \n"; #e.g ./ # exit; # } my $file1 = "ref.txt"; my $file2 = "query.txt"; # my($key1, $key2, %hash1, %hash2); open (IN1,'<'.$file1) || die "***can't open the file $!\n"; my @lines1 = <IN1>; close IN1; #$i=0; for (@lines1) { chomp; my @a1 = split(/\t/, $_); my $key1 = $a1[0]; my $rs = $a1[1]; my $re = $a1[2]; #push(@{ $hash1->{$key1} } , "$rs\t$re" ); push(@{ $hash1->{$key1} } , $_ ); } open (IN2,'<'.$file2) || die "***can't open the file $!\n"; my @lines2 = <IN2>; close IN1; for (@lines2) { chomp; my @a2 = split(/\t/, $_); my $qs = $a2[1]; my $qe = $a2[2]; my $key2 = $a2[0]; #push(@{ $hash2->{$key2} } , "$qs\t$qe"); push(@{ $hash2->{$key2} } , $_ ); } #print Dumper(\%$hash2); @common_keys = grep { exists $hash1->{$_} } sort keys %$hash2; my %seen; for (sort @common_keys) { for my $r (0..$#{ $hash1->{$_} }) { for my $q (0..$#{ $hash2->{$_} }) { my ($query_key, $query_start, $query_end, @qtail) = split(/\t/, $hash2->{$_}[$q]); my ($ref_key, $ref_start, $ref_end, @rtail) = split(/\t/, +$hash1->{$_}[$r]); if( ($query_start >= $ref_start && $query_start < += $ref_end) || ($query_end >= $ref_start && $query_end <= $r +ef_end) || ($ref_start >= $query_start && $ref_start <= +$query_end) || ($ref_end >= $query_start && $ref_end <= $que +ry_end) ) { $seen{$_}{$query_start}++; } } } } #print Dumper(\%seen); my $overlap_count; for my $key (sort keys %$hash2) { for my $i (0..$#{ $hash2->{$key} } ) { my @s = split(/\t/, $hash2->{$key}[$i]); my @head = @s[0..2]; my @tail = @s[3..$#s]; #print "***$tail[0]\n"; my $start = $s[1]; # print "***$start\n"; if( exists $seen{$key}{$start} ) { $overlap_count = $seen{$key}{$start}; #print "$hash2->{$key}[$i]\t$overlap_count\n" print map {"$_\t"} insert_field(\@head, \$overlap_count, \ +@tail); print "\n" } else { $overlap_count = 0; #print "$hash2->{$key}[$i]\t$overlap_count\n"; print map {"$_\t"} insert_field(\@head, \$overlap_count, \ +@tail); print "\n" } #print "\n"; } }

        Why is the code you've posted so completely broken?

        • Where is the subroutine insert_field()
        • Why are %hash1 & %hash2 decalared as hashes, but used throughout as references: $hash1->{ ... }
        • Why is @common_keys never declared?

        Did you think that you could fool us by sticking use strict; use warnings; at the top and declaring some of the variables with my, and get us to solve your problems that you've failed to solve?

        Tip: If you strict and warnings from the get go, you'd find it far easier to get your code right as you go, rather than painting yourself into a corner of undeclared and uninitialised variables.

        I have a 25 line solution to what I interpret from your description to be correct. It produces this output from your samples:

        C:\test>819005 c1 100 12000 11901 + AT c1 100 12000 11901 + AT c1 100 12000 12989 + AT c1 100 12000 13901 + AT c1 19800 20000 206 - AG c9 10078 14008 3931 - AG c9 1078 10008 9109 - TA

        But I cannot check whether that is correct because your code doesn't produce anything for me to compare it against :(


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.