a217 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I was hoping one of the monks could give me some guidance regarding a short script I'm working on. Below is sample data separated by tab:

chr start end pos1 pos2 CGtype Cov. Methy Stra +nd chr1 18598 19673 18676 18676 CHH 0 0 + chr1 18598 19673 18689 18689 CG 2 0 + chr1 18598 19673 18997 18997 CHH 0 0 + chr1 18598 19673 19546 19546 CG 4 0 + chr1 18598 19673 19671 19671 CHG 7 0 + chr1 124987 125426 125001 125001 CHH 1 0 + chr1 124987 125426 125226 125226 CG 0 0 + chr1 124987 125426 125426 125426 CG 0 0 + chr1 317653 318092 317653 317653 CHG 11 0 + chr1 317653 318092 317795 317795 CHG 0 0 + chr1 317653 318092 318090 318090 CHH 3 0 + chr1 427014 428027 427025 427025 CHH 0 0 + chr1 439136 440407 439687 439687 CHH 9 0 + chr1 523082 523977 523167 523167 CG 0 0 + chr1 534601 536512 535789 535789 CHH 1 0 + chr1 703847 704410 703999 703999 CHH 0 0 + chr1 752279 753308 753330 753330 CHH 0 0 +

On each line, a chromosome number is listed along with a region defined by start and end positions. The values in the position columns should correspond to the region. My goal is to create a summary file with columns as follows:

"CpG" "length" "CG" "CG>0" "CG>5" "CG>10" "CG>0mean" "CG>0median" "CG>5mean" "CG>5median" "CG>10mean" "CG>10median"

Note these are just the column names, and I've used quotations to help distinguish them. The actual data is separated by tab.

The "CpG" column will serve as the ID. For example, the first ID would be chr1:18598-19673. The second column, "length," is the total number of lines corresponding to each ID. In the above example, the length of chr1:18598-19673 is 5. The next column, "CG," represents the total number of lines in the region that are labeled CG (as opposed to CHH or CHG) and the above example would have a value of 2. The column, "CG>0", represents the CG sites with coverage (in column "Cov." in the example) greater than zero (hence, chr1:18598-19673 would have CG>0 value of 2). It follows that the columns "CG>5" and "CG>10" represents instances where the coverage is greater than 5 and greater than 10, respectively. (and their values for chr1:18598-19673 region would be zero since there are no CG sites with coverage greater than 5 or 10).

Now comes the main issue I've had in approaching this. The following "mean" and "median" columns must calculate the mean and median of the methylation count (listed in the "Methy" column in the example). In the example, all of the methy values are zero, so the mean and median for CG>0, CG>5, and CG>10 would be all zero as well. However, in the real data set, the numbers may vary. The main problem I face is that I need to calculate the mean and median for each separate sub-dataset. If you have any suggestions for me I would appreciate the assistance.

Replies are listed 'Best First'.
Re: Help with Mean/Median for data analysis
by TJPride (Pilgrim) on Dec 21, 2011 at 05:26 UTC
    Well played. You've actually managed to post something that nobody has responded to yet - a good combination of proper documentation and high level of difficulty (or work at least). Here's one potential solution - you may notice that I've added a few records on the end so you can see how mean and median come out. What you do with the end data is of course up to you.

    use Data::Dumper; use strict; use warnings; my (%data, $k, $r, $n); <DATA>; while (<DATA>) { chomp; my %d; @d{qw/chr1 start end pos1 pos2 cgtype cov methy strand/} = split / +\s+/; push @{$data{"$d{'chr1'}:$d{'start'}-$d{'end'}"}}, \%d; } for $k (sort { $data{$a}[0]{'chr1'} cmp $data{$b}[0]{'chr1'} || $data{$a}[0]{'start'} <=> $data{$b}[0]{'start'} || $data{$a}[0]{'end'} <=> $data{$b}[0]{'end'} } keys %data) { my %d; $d{'cpg'} = $k; $d{'length'} = $#{$data{$k}} + 1; for $r (@{$data{$k}}) { if ($r->{'cgtype'} eq 'CG') { $d{'cg'}++; for $n (0, 5, 10) { if ($r->{'cov'} > $n) { $d{"cg>$n"}++; push @{$d{"cg>${n}vals"}}, $r->{'methy'}; } } } } for $n (0, 5, 10) { if ($d{"cg>${n}vals"}) { $d{"cg>${n}mean"} += $_ for @{$d{"cg>${n}vals"}}; $d{"cg>${n}mean"} /= $d{"cg>${n}"}; @{$d{"cg>${n}vals"}} = sort { $a <=> $b } @{$d{"cg>${n}val +s"}}; if ($d{"cg>$n"} % 2 == 1) { $d{"cg>${n}median"} = $d{"cg>${n}vals"}[$d{"cg>$n"} / +2]; } else { $d{"cg>${n}median"} = ($d{"cg>${n}vals"}[$d{"cg>$n"} / + 2 - 1] + $d{"cg>${n}vals"}[$d{"cg>$n"} / 2]) / 2; } } else { $d{"cg>${n}mean"} = 0; $d{"cg>${n}median"} = 0; } } ### DO SOMETHING WITH DATA use Data::Dumper; print Dumper(\%d); } __DATA__ chr start end pos1 pos2 CGtype Cov. Methy Stra +nd chr1 18598 19673 18676 18676 CHH 0 0 + chr1 18598 19673 18689 18689 CG 2 0 + chr1 18598 19673 18997 18997 CHH 0 0 + chr1 18598 19673 19546 19546 CG 4 0 + chr1 18598 19673 19671 19671 CHG 7 0 + chr1 124987 125426 125001 125001 CHH 1 0 + chr1 124987 125426 125226 125226 CG 0 0 + chr1 124987 125426 125426 125426 CG 0 0 + chr1 317653 318092 317653 317653 CHG 11 0 + chr1 317653 318092 317795 317795 CHG 0 0 + chr1 317653 318092 318090 318090 CHH 3 0 + chr1 427014 428027 427025 427025 CHH 0 0 + chr1 439136 440407 439687 439687 CHH 9 0 + chr1 523082 523977 523167 523167 CG 0 0 + chr1 534601 536512 535789 535789 CHH 1 0 + chr1 703847 704410 703999 703999 CHH 0 0 + chr1 752279 753308 753330 753330 CHH 0 0 + chr2 5000 100000 100000 100000 CG 1 9 + chr2 5000 100000 100000 100000 CG 3 2 + chr2 5000 100000 100000 100000 CG 6 7 + chr2 5000 100000 100000 100000 CG 9 11 +

      Well played. You've actually managed to post something that nobody has responded to yet .

      Um, there is no time travel at perlmonks :)

        What I mean is that people usually jump all over something when it's posted. This time it sat there long enough for me to notice it and write my entire solution and nobody had replied yet.
Re: Help with Mean/Median for data analysis
by Marshall (Canon) on Dec 21, 2011 at 15:30 UTC
    Here is another variation for you. I didn't sort the input as it appears that what you are starting with is already in correct order to just process the file linearly?

    I parsed through the input to make an AoA for each record. Then I used some grep's and map's to pull out the Methy values corresponding to the various CG conditions and then calculated the mean and median on them.

    I think I got the same answer as TJPride, but my eyes may have been cross-eyed this early morning. The array processing isn't the fastest way, if that matters, but the technique used is straightforward. Enjoy.

    Update: I didn't change the code, but did adjust the spacing and I expanded the comments.

    #!/usr/bin/perl -w use strict; use List::Util qw(sum); use Data::Dumper; use constant { name =>0, start =>1, end =>2, pos1 =>3, pos2 =>4, CGtype =>5, Cov =>6, Methy =>7, Strand =>8}; my @rec=(); my $line; my $no_read=0; my $header_line = <DATA>; #skip the headers print q{"CpG" "length" "CG" "CG>0" "CG>5" "CG>10" "CG>0mean" }, q{"CG>0median" "CG>5mean" "CG>5median" "CG>10mean" }, q{"CG>10median"},"\n"; while ($no_read or defined($line=<DATA>)) { next if $line =~ /^\s*$/; my @cols = split(' ',$line); if (!@rec) #first line of record { $no_read=0; push @rec,[@cols]; } elsif ($rec[0][name] ne $cols[name] or $rec[0][start] != $cols[start] or $rec[0][end] != $cols[end]) { $no_read=1; # keep current line for next record process_record(); # record is finished @rec=(); } else #record continuation { push @rec,[@cols]; } } process_record(); #for the last one sub process_record { my $num_CG = grep{$_->[CGtype] eq 'CG'}@rec; # collect Methy values for CGtype with Cov >0,5,10 # The number of those values is a proxy for count of CGtypes # statisfying those conditions. # map{} to Methy values is used instead of scalar grep{} to # count the number of lines statisfying the condition(s) because # the @Meth_CGgtX arrays can be used easily in the mean/median # calculations. # this code does iterate over @rec 4 times, but it does have a # regularity and simplicity to it that is worth something. # I figure that if @rec is not "that big", this won't matter # much. If it does, then this is an easily identifiable sub # that can be "improved". my @Meth_CGgt0 = map{ ( $_->[CGtype] eq 'CG' and $_->[Cov]>0 ) ? $_->[Methy] :()}@rec; my @Meth_CGgt5 = map{ ( $_->[CGtype] eq 'CG' and $_->[Cov]>5 ) ? $_->[Methy] :()}@rec; my @Meth_CGgt10 = map{ ( $_->[CGtype] eq 'CG' and $_->[Cov]>10 ) ? $_->[Methy] :()}@rec; print "$rec[0][name]:$rec[0][start]-$rec[0][end] "; print ''.@rec." "; print "$num_CG "; # these are the counts of CGgt0, CGgt5, CGgt10 print ''.@Meth_CGgt0." ".@Meth_CGgt5." ".@Meth_CGgt10." "; # ternary operator is used to avoid divide by zero errors! print @Meth_CGgt0 ? (sum(@Meth_CGgt0)/@Meth_CGgt0) : 0, " "; print median(@Meth_CGgt0)," "; print @Meth_CGgt5 ? (sum(@Meth_CGgt5)/@Meth_CGgt5) : 0, " "; print median(@Meth_CGgt5)," "; print @Meth_CGgt10 ? (sum(@Meth_CGgt10)/@Meth_CGgt10) : 0, " "; print median(@Meth_CGgt10),"\n"; } sub median { my @array = @_; return 0 if @array == 0; # for NULL arrays #return $array[0] if @array == 1; # update: not necessary @array = sort {$a<=>$b}@array; if (@array % 2) { return $array[int(@array/2)]; } else { return ($array[@array/2] + $array[@array/2 - 1]) / 2; } } =prints "CpG" "length" "CG" "CG>0" "CG>5" "CG>10" "CG>0mean" "CG>0median" "CG>5mean" "CG>5median" "CG>10mean" "CG>10median" chr1:18598-19673 5 2 2 0 0 0 0 0 0 0 0 chr1:124987-125426 3 2 0 0 0 0 0 0 0 0 0 chr1:317653-318092 3 0 0 0 0 0 0 0 0 0 0 chr1:427014-428027 1 0 0 0 0 0 0 0 0 0 0 chr1:439136-440407 1 0 0 0 0 0 0 0 0 0 0 chr1:523082-523977 1 1 0 0 0 0 0 0 0 0 0 chr1:534601-536512 1 0 0 0 0 0 0 0 0 0 0 chr1:703847-704410 1 0 0 0 0 0 0 0 0 0 0 chr1:752279-753308 1 0 0 0 0 0 0 0 0 0 0 chr2:5000-100000 4 4 4 2 0 7.25 8 9 9 0 0 =cut __DATA__ chr start end pos1 pos2 CGtype Cov. Methy Strand chr1 18598 19673 18676 18676 CHH 0 0 + chr1 18598 19673 18689 18689 CG 2 0 + chr1 18598 19673 18997 18997 CHH 0 0 + chr1 18598 19673 19546 19546 CG 4 0 + chr1 18598 19673 19671 19671 CHG 7 0 + chr1 124987 125426 125001 125001 CHH 1 0 + chr1 124987 125426 125226 125226 CG 0 0 + chr1 124987 125426 125426 125426 CG 0 0 + chr1 317653 318092 317653 317653 CHG 11 0 + chr1 317653 318092 317795 317795 CHG 0 0 + chr1 317653 318092 318090 318090 CHH 3 0 + chr1 427014 428027 427025 427025 CHH 0 0 + chr1 439136 440407 439687 439687 CHH 9 0 + chr1 523082 523977 523167 523167 CG 0 0 + chr1 534601 536512 535789 535789 CHH 1 0 + chr1 703847 704410 703999 703999 CHH 0 0 + chr1 752279 753308 753330 753330 CHH 0 0 + chr2 5000 100000 100000 100000 CG 1 9 + chr2 5000 100000 100000 100000 CG 3 2 + chr2 5000 100000 100000 100000 CG 6 7 + chr2 5000 100000 100000 100000 CG 9 11 +
Re: Help with Mean/Median for data analysis
by pvaldes (Chaplain) on Dec 22, 2011 at 12:27 UTC

    This its a trivial task with GNU-R. If you plan to go further and do more complex statistics in your data, I suggest you to take a look at the Statistics::R module in cpan and call directly R from perl. Something like:

    use Statistics::R; my $R = Statistics::R->new(); $R->run(q`mydata <-read.table("mydatafile",... more options here)`); my $summary = $R->get(q`summary(mydata)`); print $summary;

    Only as a quick example and untested, but you can see that you will need to write less code to do the same. You can use any other statistic modules that you prefer of course

      I think that would be like using nuclear weapons to kill an ant!

      The real issue here appears to be: how to parse the input into a "record". Once that is done, its just a few lines to get the statistics required.

        Well, I only look over the post so I probably may be wrong, but probably the factor function of R could help here. I understand that the OP wants to apply as.factor to i.e the start column (or at the columns one to three concatenated).