Re: Statistics on Tab Delimited File

The following will calculate and print out the median value of each column (3-5), for each 'test' (column 2). This loops through your input file once (it looks like you loop twice). If this is not what you are looking for, please also include your expected output. The median function is from Acme::Tools. See also perldsc.

use strict;
use warnings;
use Acme::Tools;

my %data;
while (<DATA>) {
    my ($test, @vals) = (split)[1..4];
    my $col = 3;
    for my $val (@vals) {
        push @{ $data{$test}{$col} }, $val;
        $col++;
    }
}
#use Data::Dumper; print Dumper(\%data);

for my $test (sort keys %data) {
    for my $col (sort keys %{ $data{$test} }) {
        my $med = median(@{ $data{$test}{$col} });
        print "test=$test, col=$col, med=$med\n";
    }
}

__DATA__
contig1 test1 1e-28 28 55
contig1 test2 1e-10 22 54
contig2 test1 1e-10 24 78
contig3 test2 10 78 57
contig4 test3 1e-5 200 55
contig4 test2 10 100 43
[download]

Prints out:

test=test1, col=3, med=5e-11
test=test1, col=4, med=26
test=test1, col=5, med=66.5
test=test2, col=3, med=10
test=test2, col=4, med=78
test=test2, col=5, med=54
test=test3, col=3, med=1e-5
test=test3, col=4, med=200
test=test3, col=5, med=55
[download]

I know this is a mess!

Let perltidy clean it up for you. It also points out that your code has compile errors. You probably didn't intend to place the medianeval sub inside that foreach loop.

Comment on Re: Statistics on Tab Delimited File Select or Download Code

Replies are listed 'Best First'.
Re^2: Statistics on Tab Delimited File by Paragod28 (Novice) on Dec 10, 2009 at 21:19 UTC
Thank you so much! I have been working on a solution all week. That is near what I was looking for but I did not explain it well. That is my fault. __DATA__ Contig Organism Eval Length MappedLength contig1 test1 1e-28 28 55 contig1 test2 1e-10 22 54 contig2 test1 1e-10 24 78 contig3 test2 10 78 57 contig4 test3 1e-5 200 55 contig4 test2 10 100 43 I am trying for this output (math may not be correct for median but frequency is correct): Organism Frequency EvalMedian LengthMedian MappedMedian test2 3 5 38 47 test1 2 1e-10 24 54 test3 1 1e-5 200 55 The "Frequency being how many time I see the organism in the file. I then take all of the values when I hit multiple times and find the median of all the values combined for that particular match (test1, test2 etc). If the "Organism" does not have a match the median values are the same as found. I see that I did not get column one[0] in order but that does not matter for the final output. "test1" will actually be long scientific names. Thanks	[reply]

Replies are listed 'Best First'.

Re^2: Statistics on Tab Delimited File
by Paragod28 (Novice) on Dec 10, 2009 at 21:19 UTC

Thank you so much! I have been working on a solution all week. That is near what I was looking for but I did not explain it well. That is my fault.

The "Frequency being how many time I see the organism in the file. I then take all of the values when I hit multiple times and find the median of all the values combined for that particular match (test1, test2 etc). If the "Organism" does not have a match the median values are the same as found.

I see that I did not get column one[0] in order but that does not matter for the final output. "test1" will actually be long scientific names.

[reply]