Average columns based on header name

jacobs.smith has asked for the wisdom of the Perl Monks concerning the following question:

Hi Friends,

I have files with columns like this. This sample input below is partial.

Please check below for main file link. Each file will have only two rows.

Gene    0.4%    0.7%    1.1%    1.4%    1.8%    2.2%    2.5%    2.9%  
+  3.3%    3.6%    4.0%    4.3%    4.7%    5.1%    5.4%    5.8%    6.2
+%    6.5%    6.9%    7.2%    7.6%    8.0%    8.3%    8.7%    9.1%    
+9.4%    9.8%    10.1%    10.5%    10.9%    11.2%    11.6%    12.0%   
+ 12.3%    12.7%    13.0%    13.4%    13.8%    14.1%    14.5%    14.9%
+    15.2%    15.6%    15.9%    16.3%    16.7%    17.0%    17.4%    17
+.8%    18.1%    18.5%    18.8%    19.2%    19.6%    19.9%    20.3%   
+ 20.7%    21.0%    21.4%    21.7%    22.1%    22.5%    22.8%    23.2%
+    23.6%    23.9%    24.3%    24.6%    25.0%    25.4%    25.7%    26
+.1%    26.4%    26.8%    27.2%    27.5%    27.9%    28.3%    28.6%   
+ 29.0%    29.3%    29.7%    30.1%    30.4%    30.8%    31.2%    31.5%
+    31.9%    32.2%    32.6%    33.0%    33.3%    33.7%    34.1%    34
+.4%    34.8%    35.1%    35.5%    35.9%    36.2%    36.6%    37.0%   
+ 37.3%    37.7%    38.0%    38.4%    38.8%    39.1%    39.5%    39.9%
+    40.2%    40.6%    40.9%    41.3%    41.7%    42.0%    42.4%    42
+.8%    43.1%    43.5%    43.8%    44.2%    44.6%    44.9%    45.3%   
+ 45.7%    46.0%    46.4%    46.7%    47.1%    47.5%    47.8%    48.2%
+    48.6%    48.9%    49.3%    49.6%    50.0%    50.4%    50.7%    51
+.1%    51.4%    51.8%    52.2%    52.5%    52.9%    53.3%    53.6%   
+ 54.0%    54.3%    54.7%    55.1%    55.4%    55.8%    56.2%    56.5%
+    56.9%    57.2%    57.6%    58.0%    58.3%    58.7%    59.1%    59
+.4%    59.8%    60.1%    60.5%    60.9%    61.2%    61.6%    62.0%   
+ 62.3%    62.7%    63.0%    63.4%    63.8%    64.1%    64.5%    64.9%
+    65.2%    65.6%    65.9%    66.3%    66.7%    67.0%    67.4%    67
+.8%    68.1%    68.5%    68.8%    69.2%    69.6%    69.9%    70.3%   
+ 70.7%    71.0%    71.4%    71.7%    72.1%    72.5%    72.8%    73.2%
+    73.6%    73.9%    74.3%    74.6%    75.0%    75.4%    75.7%    76
+.1%    76.4%    76.8%    77.2%    77.5%    77.9%    78.3%    78.6%   
+ 79.0%    79.3%    79.7%    80.1%    80.4%    80.8%    81.2%    81.5%
+    81.9%    82.2%    82.6%    83.0%    83.3%    83.7%    84.1%    84
+.4%    84.8%    85.1%    85.5%    85.9%    86.2%    86.6%    87.0%   
+ 87.3%    87.7%    88.0%    88.4%    88.8%    89.1%    89.5%    89.9%
+    90.2%    90.6%    90.9%    91.3%    91.7%    92.0%    92.4%    92
+.8%    93.1%    93.5%    93.8%    94.2%    94.6%    94.9%    95.3%   
+ 95.7%    96.0%    96.4%    96.7%    97.1%    97.5%    97.8%    98.2%
+    98.6%    98.9%    99.3%    99.6%    100.0%    0.4%    0.7%    1.1
+%    1.4%    1.8%    2.2%    2.5%    2.9%    3.3%    3.6%    4.0%    
+4.3%    4.7%    5.1%    5.4%    5.8%    6.2%    6.5%    6.9%    7.2% 
+   7.6%    8.0%    8.3%    8.7%    9.1%    9.4%    9.8%    10.1%    1
+0.5%    10.9%    11.2%    11.6%    12.0%    12.3%    12.7%    13.0%  
+  13.4%    13.8%    14.1%    14.5%    14.9%    15.2%    15.6%    15.9
+%    16.3%    16.7%    17.0%    17.4%    17.8%    18.1%    18.5%    1
+8.8%    19.2%    19.6%    19.9%    20.3%    20.7%    21.0%    21.4%  
+  21.7%    22.1%    22.5%    22.8%    23.2%    23.6%    23.9%    24.3
+%    24.6%    25.0%    25.4%    25.7%    26.1%    26.4%    26.8%    2
+7.2%    27.5%    27.9%    28.3%    28.6%    29.0%    29.3%    29.7%  
+  30.1%    30.4%    30.8%    31.2%    31.5%    31.9%    32.2%    32.6
+%    33.0%    33.3%    33.7%    34.1%    34.4%    34.8%    35.1%    3
+5.5%    35.9%    36.2%    36.6%    37.0%    37.3%    37.7%    38.0%  
+  38.4%    38.8%    39.1%    39.5%    39.9%    40.2%    40.6%    40.9
+%    41.3%    41.7%    42.0%    42.4%    42.8%    43.1%    43.5%    4
+3.8%    44.2%    44.6%    44.9%    45.3%    45.7%    46.0%    46.4%  
+  46.7%    47.1%    47.5%    47.8%    48.2%    48.6%    48.9%    49.3
+%    49.6%    50.0%    50.4%    50.7%    51.1%    51.4%    51.8%    5
+2.2%    52.5%    52.9%    53.3%    53.6%    54.0%    54.3%    54.7%  
+  55.1%    55.4%    55.8%    56.2%    56.5%    56.9%    57.2%    57.6
+%    58.0%    58.3%    58.7%    59.1%    59.4%    59.8%    60.1%    6
+0.5%    60.9%    61.2%    61.6%    62.0%    62.3%    62.7%    63.0%  
+  63.4%    63.8%    64.1%    64.5%    64.9%    65.2%    65.6%    65.9
+%    66.3%    66.7%    67.0%    67.4%    67.8%    68.1%    68.5%    6
+8.8%    69.2%    69.6%    69.9%    70.3%    70.7%    71.0%    71.4%  
+  71.7%    72.1%    72.5%    72.8%    73.2%    73.6%    73.9%    74.3
+%    74.6%    75.0%    75.4%    75.7%    76.1%    76.4%    76.8%    7
+7.2%    77.5%    77.9%    78.3%    78.6%    79.0%    79.3%    79.7%  
+  80.1%    80.4%    80.8%    81.2%    81.5%    81.9%    82.2%    82.6
+%    83.0%    83.3%    83.7%    84.1%    84.4%    84.8%    85.1%    8
+5.5%    85.9%    86.2%    86.6%    87.0%    87.3%    87.7%    88.0%  
+  88.4%    88.8%    89.1%    89.5%    89.9%    90.2%    90.6%    90.9
+%    91.3%    91.7%    92.0%    92.4%    92.8%    93.1%    93.5%    9
+3.8%    94.2%    94.6%    94.9%    95.3%    95.7%    96.0%    96.4%  
+  96.7%    97.1%    97.5%    97.8%    98.2%    98.6%    98.9%    99.3
+%    99.6%    100.0%
[download]

Basically, here is what I need to be done.

a. Start from second column which is 0.4% here. b. Go until you hit "10" in the header name. If the header name is exactly 10.0%, then include that column too. If not, only include until the column before it. In this example, since we have 10.1% (29th column), we will be including columns starting from 0.4%(second) until 9.8% which is the 28th column. If the 29th column was to be 10.0%, then it would have been included too. c. Average the values for these respective columns in the second row (data is not presented here - please click this link for total dataset - https://goo.gl/W8jND7). In this example, starting from 0.4%(second column) till 9.8%(28th column). d. In the output, print first column which is "Gene", and this average value with column header being

Gene Average_10%

e. Then start from 10.1% (29th column) and check until you hit "20" in the header name. Repeat steps b through d. And print output as

Gene Average_10% Average_20%

Repeat this until you have

Gene Average_10% Average_20% Average_30% Average_40% Average_50% Average_60% Average_70% Average_80% Average_90% Average_100%

f. After you hit 100%, it means one dataset is done.

g. If you observe my column header carefully here, there is another 0.4%-100% columns after the first 100%. I will be having 13 of these 0.4%-100%s in the input file at the above link.

i. I have multiple files, the headers can be

1% 2% 3%....100%
1.5% 2.5% 3.5%....100%
[download]

It varies from file to file. But the logic of averaging(if you hit "10", "20", etc) is always the same. And the number of samples 13 is also same which means each file will have 100%s for 13 times.

Comment on Average columns based on header name Select or Download Code

Replies are listed 'Best First'.
Re: Average columns based on header name by 1nickt (Canon) on Mar 17, 2016 at 19:19 UTC
Hi jacobs.smith, welcome to the monastery. You've described your task very well. With such a clear understanding of what you need to do, you should be able to come up with some code, or even "pseudo-code", that attempts to solve the problem. Please show it here, so you can get some feedback. Perlmonks isn't a code-writing service ... The way forward always starts with a minimal test.	[reply]
Re: Average columns based on header name by Laurent_R (Canon) on Mar 17, 2016 at 23:25 UTC
To get you started, a quick test under the debugger: DB<24> $in = "0.4% 0.7% 1.1% 1.4% 1.8% 2.2% 2.5% 2.9% 3.3% 3.6% 4.0% + 4.3% 4.7% 5.1% 5.4% 5.8% 6.2% 6.5% 6.9% 7.2% 7.6% 8.0% 8.3% 8.7% 9.1 +% 9.4% 9.8% 10.1% 10.5% 10.9% 11.2<.3% 3.6% 4.0% 4.3% 4.7% 5.1% 5.4% +5.8% 6.2% 6.5% 6.9% 7.2% 7.6% 8.0% 8.3% 8.7% 9.1% 9.4% 9.8% 10.1% 10. +5% 10.9% 11.2%"; DB<25> $in =~ s/%//g; # remove the percentage signs DB<26> p $in 0.4 0.7 1.1 1.4 1.8 2.2 2.5 2.9 3.3 3.6 4.0 4.3 4.7 5.1 5.4 5.8 6.2 6 +.5 6.9 7.2 7.6 8.0 8.3 8.7 9.1 9.4 9.8 10.1 10.5 10.9 11.2 11.6 DB<26> @val = split /\s/, $in; DB<27> while ($i = shift @val and $i <= 10) {$sum += $i; $j++} DB<28> print $sum/$j; 5.07037037037037 DB<29> unshift @val, $i if $i > 10; # put back the last value into t +he array if needed [download] This might not be the best way for the final program, but it gives you basically the idea for one group of values. Then you need to create an array with the steps (10, 20, 30...100), for example: `DB<32> @steps = map $_ * 10, 1..10; DB<33> x \@steps 0 ARRAY(0x600639618) 0 10 1 20 2 30 3 40 4 50 5 60 6 70 7 80 8 90 9 100` [download] and include the above process in another external loop going through the steps. Then you finally need yet another more external loop for going through the whole data. I hope this helps you get going. If you understand the above, you can probably do the whole thing.	[reply] [d/l] [select]
Re: Average columns based on header name by Cristoforo (Curate) on Mar 18, 2016 at 23:16 UTC
After downloading the file you gave in your post, I was able to write a program that parsed it and produced the averages. There is no error checking, so the file must be in the order you described. #!/usr/bin/perl use strict; use warnings; use List::Util 'sum'; # pm 1158142 open my $fh, '<', 'j2.txt' or die $!; my (undef, @percent) = split /[\s%]+/, <$fh>; #split on spaces and % s +ign my ($gene, @data) = split ' ', <$fh>; close $fh or die $!; my $limit = 10; my $start = 0; my @avgs; print join("\t", $gene, map {sprintf "Average_%d%%", 10 * $_} 1 .. 10) +, "\n"; for my $i (0 .. $#percent) { if ($percent[$i] <= $limit) { if ($percent[$i] == $limit) { push @avgs, sprintf "%.16f", sum(@data[$start..$i]) / ($i+1 - $start); if ($limit == 100) { print join("\t", @avgs), "\n"; @avgs = (); $limit = 10; $start = $i+1; } else { $limit += 10; $start = $i+1; } } } else { push @avgs, sprintf "%.16f", sum(@data[$start..$i-1]) / ($i - $start); $limit += 10; $start = $i; } } [download] It produced the following output. Gene1 Average_10% Average_20% Average_30% Average_40% A +verage_50% Average_60% Average_70% Average_80% Average_90 +% Average_100% 0.0085539803980398 0.0029112286228623 0.0125793829382938 0.01 +16449144914491 0.0184377812781278 0.0050317531753175 0.01067 +45049504950 0.0020127012701270 0.0106745049504950 0.02959749 +09990999 0.0136619722578318 0.0178790606333361 0.0063430585482791 0.01 +45855494640373 0.0112920382947386 0.0053672033870054 0.01976 +10670157925 0.0146378274191056 0.0108215366991245 0.02352507 +97807054 0.0164707828807572 0.0136918453750137 0.0079514124251931 0.01 +20488239300120 0.0010953476300011 0.0159028248503863 0.00821 +51072250082 0.0079514124251931 0.0071197595950071 0.01752556 +20800175 0.0201577409649362 0.0194378216447599 0.0078664354985117 0.00 +71113981627170 0.0184896352230643 0.0167161754343373 0.02133 +41944881511 0.0172078276529943 0.0066373049518692 0.03887564 +32895198 0.0104368645819806 0.0270142152808031 0.0038451606354665 0.01 +00641194183384 0.0158907148710607 0.0109861732441901 0.01059 +38099140405 0.0197751118395422 0.0143016433839546 0.02330638 +18108890 0.0068663582776188 0.0225118460673360 0.0068663582776188 0.01 +50078973782240 0.0101524011676221 0.0146482309922535 0.01633 +21236174791 0.0151059882107614 0.0163321236174790 0.04193383 +09097434 0.0157328709970234 0.0251269401749335 0.0044248699679128 0.01 +46968895362819 0.0151709827471297 0.0176994798716513 0.01469 +68895362819 0.0162245232156804 0.0194378216447599 0.03223833 +83376506 0.0095465981450263 0.0179928578674928 0.0052072353518326 0.00 +62765783258696 0.0234325590832465 0.0099805344243457 0.01046 +09638764493 0.0156217060554976 0.0133900337618551 0.01129784 +09865652 0.0101156314461855 0.0097543588945361 0.0036127255164948 0.00 +27869596841532 0.0104510988155744 0.0180636275824742 0.04250 +11351833357 0.0245665335121649 0.0195087177890721 0.02229567 +74732253 0.0117016829888117 0.0178659624204179 0.0053632713698720 0.00 +75225104928075 0.0042314121522042 0.0234033659776234 0.01457 +48640798146 0.0082886921170750 0.0155151778914155 0.03432145 +41234343 0.0051755175517552 0.0188536710813939 0.0074757475747575 0.01 +88536710813938 0.0088723158030089 0.0235773577357736 0.01219 +94342291372 0.0241524152415241 0.0199627105567700 0.03881638 +16381638 0.0256052955722923 0.0146810285424147 0.0083044201856083 0.01 +20117506256120 0.0120117506256120 0.0083044201856083 0.02202 +15428136220 0.0145327353248145 0.0073405142712073 0.02402350 +12512240 0.0163571912746830 0.0146463625954432 0.0144099065991255 0.01 +42708148365857 0.0105153372480105 0.0272619854578050 0.02516 +16998434537 0.0319354686791430 0.0255372476023112 0.02516169 +98434537 [download] Hope this helps. List::Util was added to perl core in version 5.08.	[reply] [d/l] [select]