in reply to Data munging

The general idea is right, but you could simplify your average calculation.

use List::Util qw( sum ); my %data; while (<>) { chomp; my ($k, $v) = split /\t/; push @{ $data{$k} }, $v; } local $, = "\t"; local $\ = "\n"; for my $k (keys %data) { my $data = $data{$k}; print $k, 0+@$data, sum(@$data)/@$data; }

Memory usage shouldn't be a problem with 300,000 lines, but you could reduce mem usage by summing and counting the elements as you go along.

my %data; while (<>) { chomp; my ($k, $v) = split /\t/; $data{$k}[0]++ $data{$k}[1]+= $v; } local $, = "\t"; local $\ = "\n"; for my $k (keys %data) { my $data = $data{$k}; print $k, $data->[0], $data->[1]/$data->[0]; }

If the keys are sorted (or at least grouped) in the input, you could reduce memory usage to something constant.

my $last; my $sum; my $count; local $, = "\t"; local $\ = "\n"; while (<>) { chomp; my ($k, $v) = split /\t/; if (defined($last) && $k ne $last) { print $last, $count, $sum/$count; ($last, $count, $sum) = ($k, 0, 0); } $count++ $sum += $v; } if (defined($last)) { print $last, $sum/$count; }

As a one-liner, how about

perl -lane' $d{$F[0]}[0]++ $d{$F[0]}[1]+= $F[1]; }{ $, = "\t"; print $_, $d{$_}[0], $d{$_}[1]/$d{$_}[0] for keys %d; '

It can be shortened, but any simpler will affect readability.

Update: I wasn't printing out the count. Fixed.

Replies are listed 'Best First'.
Re^2: Data munging
by umasuresh (Hermit) on Jan 22, 2010 at 00:56 UTC

    Thanks much ikegami. The first two options are awesome, I am still trying to wrap my brain around the third!
      It counts lines and maintains a sum of the values seen to date. When the key changes, it prints the average, then resets the line count and the sum.
      1 196 -> count = 1 sum = 196 1 190 -> count = 2 sum = 196+190 1 200 -> count = 3 sum = 196+190+200 key changed, so print average, and reset count and sum 2 20 -> count = 1 sum = 20 key changed, so print average, and reset count and sum 3 25 -> count = 1 sum = 25 3 19 -> count = 2 sum = 25+19 3 39 -> count = 3 sum = 25+19+39 key changed, so print average, and reset count and sum 4 40 -> count = 1 sum = 40 4 41 -> count = 2 sum = 40+41 4 45 -> count = 3 sum = 40+41+45 eof, so print average