comment on

Here is another variation for you. I didn't sort the input as it appears that what you are starting with is already in correct order to just process the file linearly?

I parsed through the input to make an AoA for each record. Then I used some grep's and map's to pull out the Methy values corresponding to the various CG conditions and then calculated the mean and median on them.

I think I got the same answer as TJPride, but my eyes may have been cross-eyed this early morning. The array processing isn't the fastest way, if that matters, but the technique used is straightforward. Enjoy.

Update: I didn't change the code, but did adjust the spacing and I expanded the comments.

#!/usr/bin/perl -w
use strict;
use List::Util qw(sum);
use Data::Dumper;

use constant { name   =>0,
               start  =>1,
               end    =>2,
               pos1   =>3,
               pos2   =>4,
               CGtype =>5,
               Cov    =>6,
               Methy  =>7,
               Strand =>8};

my @rec=();
my $line;
my $no_read=0;

my $header_line = <DATA>;   #skip the headers

print q{"CpG" "length" "CG" "CG>0" "CG>5" "CG>10" "CG>0mean" },
      q{"CG>0median" "CG>5mean" "CG>5median" "CG>10mean" },
      q{"CG>10median"},"\n"; 

while ($no_read or defined($line=<DATA>))
{
     next if $line =~ /^\s*$/;
     my @cols = split(' ',$line);
     
     if (!@rec)             #first line of record
     {
         $no_read=0;
         push @rec,[@cols];
     }
     elsif ($rec[0][name]  ne $cols[name]  or  
            $rec[0][start] != $cols[start] or
            $rec[0][end]   != $cols[end])
     {
        $no_read=1;         # keep current line for next record
        process_record();   # record is finished
        @rec=();
     }
     else                   #record continuation 
     {
         push @rec,[@cols];
     }    
}
process_record();  #for the last one

sub process_record
{
    
    my $num_CG = grep{$_->[CGtype] eq 'CG'}@rec;
    
    # collect Methy values for CGtype with Cov >0,5,10
    # The number of those values is a proxy for count of CGtypes
    # statisfying those conditions.
    
    # map{} to Methy values is used instead of scalar grep{} to 
    # count the number of lines statisfying the condition(s) because
    # the @Meth_CGgtX arrays can be used easily in the mean/median
    # calculations.
    
    # this code does iterate over @rec 4 times, but it does have a
    # regularity and simplicity to it that is worth something.
    
    # I figure that if @rec is not "that big", this won't matter
    # much. If it does, then this is an easily identifiable sub
    # that can be "improved".
    
    my @Meth_CGgt0 =  map{ ( $_->[CGtype] eq 'CG' and
                             $_->[Cov]>0  ) ? $_->[Methy] :()}@rec;
    
    my @Meth_CGgt5 =  map{ ( $_->[CGtype] eq 'CG' and
                             $_->[Cov]>5  ) ? $_->[Methy] :()}@rec;
                  
    my @Meth_CGgt10 = map{ ( $_->[CGtype] eq 'CG' and
                             $_->[Cov]>10 ) ? $_->[Methy] :()}@rec;
    
    print "$rec[0][name]:$rec[0][start]-$rec[0][end] ";
    print ''.@rec." ";
    print "$num_CG ";
    
    # these are the counts of CGgt0, CGgt5, CGgt10
    print ''.@Meth_CGgt0." ".@Meth_CGgt5." ".@Meth_CGgt10." ";
    
    # ternary operator is used to avoid divide by zero errors!
    
    print @Meth_CGgt0  ? (sum(@Meth_CGgt0)/@Meth_CGgt0)   : 0, " ";
    print median(@Meth_CGgt0)," ";
    
    print @Meth_CGgt5  ? (sum(@Meth_CGgt5)/@Meth_CGgt5)   : 0, " ";
    print median(@Meth_CGgt5)," ";
    
    print @Meth_CGgt10 ? (sum(@Meth_CGgt10)/@Meth_CGgt10) : 0, " ";
    print median(@Meth_CGgt10),"\n";
}

sub median
{
   my @array = @_;
   return 0 if @array == 0;          # for NULL arrays
   #return $array[0] if @array == 1; # update: not necessary
   @array = sort {$a<=>$b}@array;
   if (@array % 2) {
      return $array[int(@array/2)];
   } else {
      return ($array[@array/2] + $array[@array/2 - 1]) / 2;
   } 
} 

=prints
"CpG" "length" "CG" "CG>0" "CG>5" "CG>10" "CG>0mean" 
"CG>0median" "CG>5mean" "CG>5median" "CG>10mean" 
"CG>10median"
chr1:18598-19673   5 2 2 0 0 0 0 0 0 0 0
chr1:124987-125426 3 2 0 0 0 0 0 0 0 0 0
chr1:317653-318092 3 0 0 0 0 0 0 0 0 0 0
chr1:427014-428027 1 0 0 0 0 0 0 0 0 0 0
chr1:439136-440407 1 0 0 0 0 0 0 0 0 0 0
chr1:523082-523977 1 1 0 0 0 0 0 0 0 0 0
chr1:534601-536512 1 0 0 0 0 0 0 0 0 0 0
chr1:703847-704410 1 0 0 0 0 0 0 0 0 0 0
chr1:752279-753308 1 0 0 0 0 0 0 0 0 0 0
chr2:5000-100000   4 4 4 2 0 7.25 8 9 9 0 0

=cut

__DATA__
chr     start    end    pos1    pos2    CGtype  Cov. Methy   Strand  
chr1    18598   19673   18676   18676   CHH      0    0       +
chr1    18598   19673   18689   18689   CG       2    0       +
chr1    18598   19673   18997   18997   CHH      0    0       +
chr1    18598   19673   19546   19546   CG       4    0   +
chr1    18598   19673   19671   19671   CHG      7   0   +
chr1    124987  125426  125001  125001  CHH 1   0   +
chr1    124987  125426  125226  125226  CG  0   0   +
chr1    124987  125426  125426  125426  CG  0   0   +
chr1    317653  318092  317653  317653  CHG 11  0   +
chr1    317653  318092  317795  317795  CHG 0   0   +
chr1    317653  318092  318090  318090  CHH 3   0   +
chr1    427014  428027  427025  427025  CHH 0   0   +
chr1    439136  440407  439687  439687  CHH 9   0   +
chr1    523082  523977  523167  523167  CG  0   0   +
chr1    534601  536512  535789  535789  CHH 1   0   +
chr1    703847  704410  703999  703999  CHH 0   0   +
chr1    752279  753308  753330  753330  CHH 0   0   +
chr2    5000    100000    100000    100000    CG    1    9    +
chr2    5000    100000    100000    100000    CG    3    2    +
chr2    5000    100000    100000    100000    CG    6    7    +
chr2    5000    100000    100000    100000    CG    9    11    +
[download]

In reply to Re: Help with Mean/Median for data analysis by Marshall
in thread Help with Mean/Median for data analysis by a217

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.