in reply to Statistics on Tab Delimited File

The following will calculate and print out the median value of each column (3-5), for each 'test' (column 2). This loops through your input file once (it looks like you loop twice). If this is not what you are looking for, please also include your expected output. The median function is from Acme::Tools. See also perldsc.
use strict; use warnings; use Acme::Tools; my %data; while (<DATA>) { my ($test, @vals) = (split)[1..4]; my $col = 3; for my $val (@vals) { push @{ $data{$test}{$col} }, $val; $col++; } } #use Data::Dumper; print Dumper(\%data); for my $test (sort keys %data) { for my $col (sort keys %{ $data{$test} }) { my $med = median(@{ $data{$test}{$col} }); print "test=$test, col=$col, med=$med\n"; } } __DATA__ contig1 test1 1e-28 28 55 contig1 test2 1e-10 22 54 contig2 test1 1e-10 24 78 contig3 test2 10 78 57 contig4 test3 1e-5 200 55 contig4 test2 10 100 43
Prints out:
test=test1, col=3, med=5e-11 test=test1, col=4, med=26 test=test1, col=5, med=66.5 test=test2, col=3, med=10 test=test2, col=4, med=78 test=test2, col=5, med=54 test=test3, col=3, med=1e-5 test=test3, col=4, med=200 test=test3, col=5, med=55
I know this is a mess!
Let perltidy clean it up for you. It also points out that your code has compile errors. You probably didn't intend to place the medianeval sub inside that foreach loop.

Replies are listed 'Best First'.
Re^2: Statistics on Tab Delimited File
by Paragod28 (Novice) on Dec 10, 2009 at 21:19 UTC

    Thank you so much! I have been working on a solution all week. That is near what I was looking for but I did not explain it well. That is my fault.

    __DATA__
    Contig Organism Eval Length MappedLength
    contig1 test1 1e-28 28 55
    contig1 test2 1e-10 22 54
    contig2 test1 1e-10 24 78
    contig3 test2 10 78 57
    contig4 test3 1e-5 200 55
    contig4 test2 10 100 43
    I am trying for this output (math may not be correct for median but frequency is correct):
    Organism Frequency EvalMedian LengthMedian MappedMedian
    test2 3 5 38 47
    test1 2 1e-10 24 54
    test3 1 1e-5 200 55

    The "Frequency being how many time I see the organism in the file. I then take all of the values when I hit multiple times and find the median of all the values combined for that particular match (test1, test2 etc). If the "Organism" does not have a match the median values are the same as found.

    I see that I did not get column one[0] in order but that does not matter for the final output. "test1" will actually be long scientific names.

    Thanks