Statistics via hash- NCBI BLAST Tab Delimited file

Paragod28 has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to parse a text delimited file that contains sequence data and find the median on each column per organism. I posted a few days ago but I have still been working on an appropriate solution. Toolic helped me last time and I appreciate it but I still seem to be lost. Toolic did create a working solution but I still needed frequency and the proper format via organism. Thanks again for any help. http://www.perlmonks.org/?node_id=812285

Here is an example of my input data:

contig1 AC344 organism1 1e-1 122 45
contig1 AC344 organism1 1e-2 122 45
contig1 AC346 organism2 1e-102 122 46
contig1 Ac346 organism2 1e-100 122 46
contig1 Ac346 organism2 1e-114 122 46
contig1 Ac346 organism2 1e-111 122 46
contig2 NC333 organism3 1e-2 155 90
contig3 NC444 organism4 1 188 50
contig3 NC444 organism4 12 188 50

The first two columns are correct but it breaks apart when I try to get the medians. Here is an example of my output so far:

Organism Frequency Median_Eval Median_Contig_Length Median_Mapped_Length
organism3: 1 ARRAY(0x2838664) 0
organism1: 2 ARRAY(0x2cc2e4) 0
organism4: 2 ARRAY(0x28386b4) 0
organism2: 4 ARRAY(0x27d11ec) 0

Here is my code so far:

use strict;
use warnings;


use Acme::Tools;

my (%count, %organisms, %med, %number);
my $ref_filelist = $ARGV[0];
my ($contig, $accession, $organism, $eval, $con_length, $map_length);
open(FILELIST, $ref_filelist )
   or die "Could not open Reference filelist...($!)";
 
 print "Organism\tFrequency\tMedian_Eval\tMedian_Contig_Length\tMedian
+_Mapped_Length\n"; 
   while (<FILELIST>){
      
( $contig, $accession, $organism, $eval, $con_length, $map_length ) = 
+split ( '\t',);

#my $median = $eval[($#eval / 2)];
my $med = median(@{$organisms{$organism}});
$med{$organism} = $med;

#$organisms{$organism} = $eval;

my $number = ++$count{$organism};
$number{$organism} = $number;

}
foreach $organism (sort {$number{$a} <=> $number{$b}} keys %organisms)
+{
    print "$organism:\t$number{$organism}\t$organisms{$organism}\t$med
+{$organism}\n" ;
}
[download]

Comment on Statistics via hash- NCBI BLAST Tab Delimited file Download Code

Replies are listed 'Best First'.
Re: Statistics via hash- NCBI BLAST Tab Delimited file by bv (Friar) on Dec 14, 2009 at 23:20 UTC
Your code prints out <c>$organisms{$organism} - Where does that get set? The only other reference is on line 19, where you autovivify it as an array ref, which is why it gets stringified like that. You probably meant to push some value onto an array within your loop, and then do the median on the array outside the loop. Try to clean up your code so that the headings make sense for the data you are printing. Your headings are for an organism, a frequency, and 3 medians, but you output the organism, frequency, and two unclear fields. What is %organisms supposed to contain, anyway? `@_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;` [download]	[reply] [d/l]
Re: Statistics via hash- NCBI BLAST Tab Delimited file by desemondo (Hermit) on Dec 15, 2009 at 01:45 UTC
You've made a good start certainly. Here are a few pointers to consider: 1. Your input file "contig1 ac346... etc" wasn't in a code block, so the tab characters don't render properly ... 2. I'm fairly sure others would agree that declaring a scalar with the same name as a hash (or an array) is ambiguous and asking for trouble. In your particular case, Perl does seem to do what you intended, however, later in the future you may extend this script and all of a sudden it breaks. `my $med = median(@{$organisms{$organism}}); #bad form, masks %med + declared in outer scope. Use a different name for this variable, eg. + $med_calculated. $med{$organism} = $med;` [download] (Also, same goes for your $number assignment section) 3. When submitting scripts onto PM, use the <DATA> handle for your source data - it makes things a little more portable, and easier for others to reproduce what your seeing, and then offer assistance or a solution. 4. I'm not sure why you're calculating the median of `$organisms{$organism}` before its been populated with anything... maybe you need another loop after you've finished reading in your source data...? 5. Your headline appears to have 5 columns, but your data printing loop:- `foreach $organism (sort {$number{$a} <=> $number{$b}} keys %organisms) +{ print "$organism:\t$number{$organism}\t$organisms{$organism}\t$med +{$organism}\n" ; }` [download] only has 4 values. something quite innocent like this is often an "oh yeh, i know thats wrong, but I'll fix it later". Make things easier for yourself. The k.i.s.s. acronym is a very wise one... 6. As you've discovered, the median function takes a list as it's input: `print median(1, 100, 101); # 100` since Perl's hashes and arrays can only ever hold scalars (or references), something isn't quite right there with the the median calculation... A good place to start would be to break down what you've got in 1 step, into 3 or 4... This also makes debugging easier as you can insert `warn $varible_name` statements to confirm they contain the values you are expecting them to. use strict; use warnings; use Acme::Tools; my (%count, %organisms, %med, %number); my ($contig, $accession, $organism, $eval, $con_length, $map_length); #my $ref_filelist = $ARGV[0]; #open(FILELIST, $ref_filelist ) # or die "Could not open Reference filelist...($!)"; print "Organism\tFrequency\tMedian_Eval\tMedian_Contig_Length\tMedian_ +Mapped_Length\n"; #while (<FILELIST>){ while (<DATA>){ ( $contig, $accession, $organism, $eval, $con_length, $map_length +) = split ( '\t',); #my $median = $eval[($#eval / 2)]; my $med_calculated = median(@{$organisms{$organism}}); $med{$organism} = $med_calculated; #$organisms{$organism} = $eval; my $number_calculated = ++$count{$organism}; $number{$organism} = $number_calculated; } foreach $organism (sort {$number{$a} <=> $number{$b}} keys %organisms) +{ print "$organism:\t$number{$organism}\t$organisms{$organism}\t$med +{$organism}\n" ; } __DATA__ contig1 AC344 organism1 1e-1 122 45 contig1 AC344 organism1 1e-2 122 45 contig1 AC346 organism2 1e-102 122 46 contig1 Ac346 organism2 1e-100 122 46 contig1 Ac346 organism2 1e-114 122 46 contig1 Ac346 organism2 1e-111 122 46 contig2 NC333 organism3 1e-2 155 90 contig3 NC444 organism4 1 188 50 contig3 NC444 organism4 12 188 50 [download]	[reply] [d/l] [select]
Re: Statistics via hash- NCBI BLAST Tab Delimited file by Anonymous Monk on Dec 15, 2009 at 02:12 UTC
Using toolic's code from Statistics on Tab Delimited File, I was able to adapt to what you seem to want. use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; while (<DATA>) { my ($organism, @vals) = (split)[2..5]; $freq{ $organism }++; my $col = 3; for my $val (@vals) { push @{ $data{$organism}{$col} }, $val; $col++; } } my @headers = qw/ Organism Frequency Median_Eval Median_Contig_Length + Median_Mapped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers ); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); } print $tb; __DATA__ contig1 AC344 organism1 1e-1 122 45 contig1 AC344 organism1 1e-2 122 45 contig1 AC346 organism2 1e-102 122 46 contig1 Ac346 organism2 1e-100 122 46 contig1 Ac346 organism2 1e-114 122 46 contig1 Ac346 organism2 1e-111 122 46 contig2 NC333 organism3 1e-2 155 90 contig3 NC444 organism4 1 188 50 contig3 NC444 organism4 12 188 50 [download] Output was `C:\perlp>812293.pl Organism Frequency Median_Eval Median_Contig_Length Median_Mappe +d_Length organism2 4 5.000000005e-103 122 46 organism1 2 0.055 122 45 organism4 2 6.5 188 50 organism3 1 1e-2 155 90` [download] Is this what you were looking for? Chris	[reply] [d/l] [select]
Re^2: Statistics via hash- NCBI BLAST Tab Delimited file by Paragod28 (Novice) on Dec 15, 2009 at 15:40 UTC
That is exactly what I was trying to do. Now that I look over my attempt I can see some of the mistakes you all have pointed out. I appreciate the help from all of you monks. I will go through the examples so I can understand the code behind it. I may have some more questions on just exactly why certain things were done. Thank you all!	[reply]
Re^2: Statistics via hash- NCBI BLAST Tab Delimited file by Cristoforo (Curate) on Dec 15, 2009 at 20:28 UTC
I noticed 2 changes that should be made. `while (<DATA>) { chomp; my ($organism, @vals) = (split /\t/)[2..5]; $freq{ $organism }++;` [download] The line sfter the read needs to be chomped. And, the split should occur on tabs, not on whitespace. (When I ran the test program above, I split on whitespace instead of tabs because the sample data I copied into the program was separated by spaces, not tabs). As long as `$organism` itself doesn't contain spaces you could split on whitespace. But I don't know enough about your data. Chris	[reply] [d/l] [select]
Re^3: Statistics via hash- NCBI BLAST Tab Delimited file by Paragod28 (Novice) on Dec 16, 2009 at 15:27 UTC
Yes I caught that. My data does have whitespace in the organism column. Thanks. I have been trying to work through the code and have another question. How can I add column 2(text) of the original data to the output and keep it referenced to organism?	[reply]
Re^3: Statistics via hash- NCBI BLAST Tab Delimited file by Paragod28 (Novice) on Dec 16, 2009 at 20:37 UTC
Yes you are correct Chris. I was having difficulty keeping column 2(accession) referenced to column 3(organism). I understand the code as the values are pushed into the array but I did not understand how to keep the above referenced to each other. I assume a hash but I was not sure how to implement it along with the other hashes. I was also having a problem with the organism names being too long. I came up with a solution for both. I would rather learn than be given a solution so let me know if this is acceptable practice. Again, thanks for your help. use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; my $ref_filelist = $ARGV[0]; open(BLASTFILE, $ref_filelist ) or die "Could not open Reference filelist...($!)"; while (<BLASTFILE>) { chomp; my ($accession,$organism, @vals) = (split /\t/)[1..5]; my $organismcut = substr( $organism, 0,75 ); my $tot =$accession . $organismcut; #print "$tot\n"; $freq{ $tot }++; my $col = 4; for my $val (@vals) { push @{ $data{$tot} {$col} }, $val; $col++; } } my @headers = qw/ Organism Freq Median_Eval Med_Contig_Length Med_Map +ped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); #print "@row\n"; } print $tb; [download]	[reply] [d/l]
Re^4: Statistics via hash- NCBI BLAST Tab Delimited file by Cristoforo (Curate) on Dec 17, 2009 at 01:33 UTC