Re: Data munging

The general idea is right, but you could simplify your average calculation.

use List::Util qw( sum );

my %data;
while (<>) {
    chomp;
    my ($k, $v) = split /\t/;
    push @{ $data{$k} }, $v;
}

local $, = "\t";
local $\ = "\n";
for my $k (keys %data) {
    my $data = $data{$k};
    print $k, 0+@$data, sum(@$data)/@$data;
}
[download]

Memory usage shouldn't be a problem with 300,000 lines, but you could reduce mem usage by summing and counting the elements as you go along.

my %data;
while (<>) {
    chomp;
    my ($k, $v) = split /\t/;
    $data{$k}[0]++
    $data{$k}[1]+= $v;
}

local $, = "\t";
local $\ = "\n";
for my $k (keys %data) {
    my $data = $data{$k};
    print $k, $data->[0], $data->[1]/$data->[0];
}
[download]

If the keys are sorted (or at least grouped) in the input, you could reduce memory usage to something constant.

my $last;
my $sum;
my $count;
local $, = "\t";
local $\ = "\n";
while (<>) {
    chomp;
    my ($k, $v) = split /\t/;
    if (defined($last) && $k ne $last) {
        print $last, $count, $sum/$count;
        ($last, $count, $sum) = ($k, 0, 0);
    }
    $count++
    $sum += $v;
}

if (defined($last)) {
    print $last, $sum/$count;
}
[download]

As a one-liner, how about

perl -lane'
    $d{$F[0]}[0]++
    $d{$F[0]}[1]+= $F[1];
}{
    $, = "\t";
    print $_, $d{$_}[0], $d{$_}[1]/$d{$_}[0] for keys %d;
'
[download]

It can be shortened, but any simpler will affect readability.

Update: I wasn't printing out the count. Fixed.

Comment on Re: Data munging Select or Download Code

Replies are listed 'Best First'.
Re^2: Data munging by umasuresh (Hermit) on Jan 22, 2010 at 00:56 UTC
Thanks much ikegami. The first two options are awesome, I am still trying to wrap my brain around the third!	[reply]
Re^3: Data munging by ikegami (Patriarch) on Jan 22, 2010 at 01:01 UTC
It counts lines and maintains a sum of the values seen to date. When the key changes, it prints the average, then resets the line count and the sum. `1 196 -> count = 1 sum = 196 1 190 -> count = 2 sum = 196+190 1 200 -> count = 3 sum = 196+190+200 key changed, so print average, and reset count and sum 2 20 -> count = 1 sum = 20 key changed, so print average, and reset count and sum 3 25 -> count = 1 sum = 25 3 19 -> count = 2 sum = 25+19 3 39 -> count = 3 sum = 25+19+39 key changed, so print average, and reset count and sum 4 40 -> count = 1 sum = 40 4 41 -> count = 2 sum = 40+41 4 45 -> count = 3 sum = 40+41+45 eof, so print average` [download]	[reply] [d/l]