rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
Once again I need your kind help. I am trying to calculate sum of column-3 by window size of column-2 in the following example. In this example I would like to divide positions in column-2 into 20 base pair wide windows as below:

window_of_col-2 positions(col-2) sum of col-3 (coverage) based +on window --------------- -------- --------------- 1-20 1,4,7 9 21-40 22,24,38 21 41-60 44,50,57,60 85 61-80 65 30 And, I want to print the results as below: window SUM 20 9 40 21 60 85 80 30

I spent a good amount of time on this, and I feel pretty stupid already. I tried range operator and tried to add a counter for window size, but could not make it work. Only the thing I can do is following that calculates sum of col-3 but not by window size. I would appriciate any help or pointers. Thanks.

#!/usr/bin/perl use warnings; use strict; use 5.010; my $total = 0; while (<DATA>){ chomp; my ($chr, $pos, $coverage) = split /\t/; $total += $coverage; } say $total; #------------------------- #data format: #chr positon_on_DNA coverage __DATA__ chr 1 2 chr 4 2 chr 7 5 chr 22 5 chr 24 6 chr 38 10 chr 44 10 chr 50 20 chr 57 25 chr 60 30 chr 65 30

Replies are listed 'Best First'.
Re: How to calculate sum of a column by window size based on another column
by GrandFather (Saint) on Jun 27, 2013 at 01:29 UTC

    If the rows are not ordered by pos then you need to keep a total for each bin. Consider:

    #!/usr/bin/perl use warnings; use strict; use 5.010; my $binSize = 20; my %binTotals; while (<DATA>){ chomp; my ($chr, $pos, $coverage) = split /\t/; $binTotals{int(($pos - 1) / $binSize)} += $coverage; } printf "%4d %d\n", 20 * (1 + $_), $binTotals{$_} for sort {$a <=> $b} +keys %binTotals; __DATA__ chr 1 2 chr 4 2 chr 7 5 chr 22 5 chr 24 6 chr 38 10 chr 44 10 chr 50 20 chr 57 25 chr 60 30 chr 65 30

    Prints:

    20 9 40 21 60 85 80 30

    and actually that is a lot cleaner than code that assumes the rows ore ordered so doesn't need to accumulate totals for bins:

    my $binSize = 20; my $bin = 0; my $total = 0; while ((my $line = <DATA> // '') || defined $bin){ chomp $line; my ($chr, $pos, $coverage) = split /\t/, $line; my $thisBin; $thisBin = int(($pos - 1) / $binSize) if defined $pos; if (! defined $thisBin || $bin != $thisBin) { printf "%4d %d\n", 20 * (1 + $bin), $total; last if ! defined $thisBin; $bin = $thisBin; $total = 0; } $total += $coverage; }
    True laziness is hard work
Re: How to calculate sum of a column by window size based on another column
by Cristoforo (Curate) on Jun 27, 2013 at 04:53 UTC
    #!/usr/bin/perl use warnings; use strict; my %total; my $limit = 20; while (<DATA>){ chomp; my ($chr, $pos, $coverage) = split /\t/; $limit += 20 while $pos > $limit; # found a new limit and the hash has prior data # (i.e., not the case of the first record and no data in hash yet) if (! exists $total{$limit} && %total) { # there is only 1 key in the hash print join("\t", each %total), "\n"; undef %total; } $total{$limit} += $coverage; } print join("\t", each %total), "\n"; __DATA__ chr 1 2 chr 4 2 chr 7 5 chr 22 5 chr 24 6 chr 38 10 chr 44 10 chr 50 20 chr 57 25 chr 60 30 chr 65 30 chr 105 30
    Output
    20 9 40 21 60 85 80 30 120 30

    Update: used the 'each' keyword to feed join

Re: How to calculate sum of a column by window size based on another column
by rnaeye (Friar) on Jun 27, 2013 at 02:27 UTC

    Thank you!