Paragod28 has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to parse a text delimited file that contains sequence data and find the median on each column per organism. I posted a few days ago but I have still been working on an appropriate solution. Toolic helped me last time and I appreciate it but I still seem to be lost. Toolic did create a working solution but I still needed frequency and the proper format via organism. Thanks again for any help. http://www.perlmonks.org/?node_id=812285

Here is an example of my input data:

contig1 AC344 organism1 1e-1 122 45
contig1 AC344 organism1 1e-2 122 45
contig1 AC346 organism2 1e-102 122 46
contig1 Ac346 organism2 1e-100 122 46
contig1 Ac346 organism2 1e-114 122 46
contig1 Ac346 organism2 1e-111 122 46
contig2 NC333 organism3 1e-2 155 90
contig3 NC444 organism4 1 188 50
contig3 NC444 organism4 12 188 50

The first two columns are correct but it breaks apart when I try to get the medians. Here is an example of my output so far:

Organism Frequency Median_Eval Median_Contig_Length Median_Mapped_Length
organism3: 1 ARRAY(0x2838664) 0
organism1: 2 ARRAY(0x2cc2e4) 0
organism4: 2 ARRAY(0x28386b4) 0
organism2: 4 ARRAY(0x27d11ec) 0

Here is my code so far:

use strict; use warnings; use Acme::Tools; my (%count, %organisms, %med, %number); my $ref_filelist = $ARGV[0]; my ($contig, $accession, $organism, $eval, $con_length, $map_length); open(FILELIST, $ref_filelist ) or die "Could not open Reference filelist...($!)"; print "Organism\tFrequency\tMedian_Eval\tMedian_Contig_Length\tMedian +_Mapped_Length\n"; while (<FILELIST>){ ( $contig, $accession, $organism, $eval, $con_length, $map_length ) = +split ( '\t',); #my $median = $eval[($#eval / 2)]; my $med = median(@{$organisms{$organism}}); $med{$organism} = $med; #$organisms{$organism} = $eval; my $number = ++$count{$organism}; $number{$organism} = $number; } foreach $organism (sort {$number{$a} <=> $number{$b}} keys %organisms) +{ print "$organism:\t$number{$organism}\t$organisms{$organism}\t$med +{$organism}\n" ; }

Replies are listed 'Best First'.
Re: Statistics via hash- NCBI BLAST Tab Delimited file
by bv (Friar) on Dec 14, 2009 at 23:20 UTC

    Your code prints out <c>$organisms{$organism}

    - Where does that get set? The only other reference is on line 19, where you autovivify it as an array ref, which is why it gets stringified like that. You probably meant to push some value onto an array within your loop, and then do the median on the array outside the loop. Try to clean up your code so that the headings make sense for the data you are printing. Your headings are for an organism, a frequency, and 3 medians, but you output the organism, frequency, and two unclear fields. What is %organisms supposed to contain, anyway?


    @_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;
Re: Statistics via hash- NCBI BLAST Tab Delimited file
by desemondo (Hermit) on Dec 15, 2009 at 01:45 UTC
    You've made a good start certainly. Here are a few pointers to consider:

    1. Your input file "contig1 ac346... etc" wasn't in a code block, so the tab characters don't render properly ...

    2. I'm fairly sure others would agree that declaring a scalar with the same name as a hash (or an array) is ambiguous and asking for trouble. In your particular case, Perl does seem to do what you intended, however, later in the future you may extend this script and all of a sudden it breaks.
    my $med = median(@{$organisms{$organism}}); #bad form, masks %med + declared in outer scope. Use a different name for this variable, eg. + $med_calculated. $med{$organism} = $med;
    (Also, same goes for your $number assignment section)

    3. When submitting scripts onto PM, use the <DATA> handle for your source data - it makes things a little more portable, and easier for others to reproduce what your seeing, and then offer assistance or a solution.

    4. I'm not sure why you're calculating the median of $organisms{$organism} before its been populated with anything... maybe you need another loop after you've finished reading in your source data...?

    5. Your headline appears to have 5 columns, but your data printing loop:-
    foreach $organism (sort {$number{$a} <=> $number{$b}} keys %organisms) +{ print "$organism:\t$number{$organism}\t$organisms{$organism}\t$med +{$organism}\n" ; }
    only has 4 values. something quite innocent like this is often an "oh yeh, i know thats wrong, but I'll fix it later". Make things easier for yourself. The k.i.s.s. acronym is a very wise one...

    6. As you've discovered, the median function takes a list as it's input: print median(1, 100, 101);   # 100

    since Perl's hashes and arrays can only ever hold scalars (or references), something isn't quite right there with the the median calculation... A good place to start would be to break down what you've got in 1 step, into 3 or 4... This also makes debugging easier as you can insert warn $varible_name statements to confirm they contain the values you are expecting them to.


    use strict; use warnings; use Acme::Tools; my (%count, %organisms, %med, %number); my ($contig, $accession, $organism, $eval, $con_length, $map_length); #my $ref_filelist = $ARGV[0]; #open(FILELIST, $ref_filelist ) # or die "Could not open Reference filelist...($!)"; print "Organism\tFrequency\tMedian_Eval\tMedian_Contig_Length\tMedian_ +Mapped_Length\n"; #while (<FILELIST>){ while (<DATA>){ ( $contig, $accession, $organism, $eval, $con_length, $map_length +) = split ( '\t',); #my $median = $eval[($#eval / 2)]; my $med_calculated = median(@{$organisms{$organism}}); $med{$organism} = $med_calculated; #$organisms{$organism} = $eval; my $number_calculated = ++$count{$organism}; $number{$organism} = $number_calculated; } foreach $organism (sort {$number{$a} <=> $number{$b}} keys %organisms) +{ print "$organism:\t$number{$organism}\t$organisms{$organism}\t$med +{$organism}\n" ; } __DATA__ contig1 AC344 organism1 1e-1 122 45 contig1 AC344 organism1 1e-2 122 45 contig1 AC346 organism2 1e-102 122 46 contig1 Ac346 organism2 1e-100 122 46 contig1 Ac346 organism2 1e-114 122 46 contig1 Ac346 organism2 1e-111 122 46 contig2 NC333 organism3 1e-2 155 90 contig3 NC444 organism4 1 188 50 contig3 NC444 organism4 12 188 50
Re: Statistics via hash- NCBI BLAST Tab Delimited file
by Anonymous Monk on Dec 15, 2009 at 02:12 UTC
    Using toolic's code from Statistics on Tab Delimited File, I was able to adapt to what you seem to want.
    use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; while (<DATA>) { my ($organism, @vals) = (split)[2..5]; $freq{ $organism }++; my $col = 3; for my $val (@vals) { push @{ $data{$organism}{$col} }, $val; $col++; } } my @headers = qw/ Organism Frequency Median_Eval Median_Contig_Length + Median_Mapped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers ); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); } print $tb; __DATA__ contig1 AC344 organism1 1e-1 122 45 contig1 AC344 organism1 1e-2 122 45 contig1 AC346 organism2 1e-102 122 46 contig1 Ac346 organism2 1e-100 122 46 contig1 Ac346 organism2 1e-114 122 46 contig1 Ac346 organism2 1e-111 122 46 contig2 NC333 organism3 1e-2 155 90 contig3 NC444 organism4 1 188 50 contig3 NC444 organism4 12 188 50
    Output was
    C:\perlp>812293.pl Organism Frequency Median_Eval Median_Contig_Length Median_Mappe +d_Length organism2 4 5.000000005e-103 122 46 organism1 2 0.055 122 45 organism4 2 6.5 188 50 organism3 1 1e-2 155 90
    Is this what you were looking for?

    Chris

      That is exactly what I was trying to do. Now that I look over my attempt I can see some of the mistakes you all have pointed out. I appreciate the help from all of you monks. I will go through the examples so I can understand the code behind it. I may have some more questions on just exactly why certain things were done. Thank you all!

      I noticed 2 changes that should be made.
      while (<DATA>) { chomp; my ($organism, @vals) = (split /\t/)[2..5]; $freq{ $organism }++;

      The line sfter the read needs to be chomped. And, the split should occur on tabs, not on whitespace. (When I ran the test program above, I split on whitespace instead of tabs because the sample data I copied into the program was separated by spaces, not tabs).

      As long as $organism itself doesn't contain spaces you could split on whitespace. But I don't know enough about your data.

      Chris

        Yes I caught that. My data does have whitespace in the organism column. Thanks. I have been trying to work through the code and have another question. How can I add column 2(text) of the original data to the output and keep it referenced to organism?

        Yes you are correct Chris. I was having difficulty keeping column 2(accession) referenced to column 3(organism). I understand the code as the values are pushed into the array but I did not understand how to keep the above referenced to each other. I assume a hash but I was not sure how to implement it along with the other hashes. I was also having a problem with the organism names being too long. I came up with a solution for both. I would rather learn than be given a solution so let me know if this is acceptable practice. Again, thanks for your help.

        use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; my $ref_filelist = $ARGV[0]; open(BLASTFILE, $ref_filelist ) or die "Could not open Reference filelist...($!)"; while (<BLASTFILE>) { chomp; my ($accession,$organism, @vals) = (split /\t/)[1..5]; my $organismcut = substr( $organism, 0,75 ); my $tot =$accession . $organismcut; #print "$tot\n"; $freq{ $tot }++; my $col = 4; for my $val (@vals) { push @{ $data{$tot} {$col} }, $val; $col++; } } my @headers = qw/ Organism Freq Median_Eval Med_Contig_Length Med_Map +ped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); #print "@row\n"; } print $tb;