Re: Statistics via hash- NCBI BLAST Tab Delimited file

Using toolic's code from Statistics on Tab Delimited File, I was able to adapt to what you seem to want.

use strict;
use warnings;
use Acme::Tools;
use Text::Table;

my %data;
my %freq;

while (<DATA>) {
    my ($organism, @vals) = (split)[2..5];
    $freq{ $organism }++;
    my $col = 3;
    for my $val (@vals) {
        push @{ $data{$organism}{$col} }, $val;
        $col++;
    }
}

my @headers =  qw/ Organism Frequency Median_Eval Median_Contig_Length
+ Median_Mapped_Length /;

my $tb = Text::Table->new( map {title => $_}, @headers );

for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) {
    my @row = ($test, $freq{ $test });
    for my $col (sort keys %{ $data{$test} }) {
        push @row, median(@{ $data{$test}{$col} });
    }
    $tb->load( [@row] );
}
print $tb;
__DATA__
contig1 AC344 organism1 1e-1 122 45
contig1 AC344 organism1 1e-2 122 45
contig1 AC346 organism2 1e-102 122 46
contig1 Ac346 organism2 1e-100 122 46
contig1 Ac346 organism2 1e-114 122 46
contig1 Ac346 organism2 1e-111 122 46
contig2 NC333 organism3 1e-2 155 90
contig3 NC444 organism4 1 188 50
contig3 NC444 organism4 12 188 50
[download]

Output was

C:\perlp>812293.pl
Organism  Frequency Median_Eval      Median_Contig_Length Median_Mappe
+d_Length
organism2 4         5.000000005e-103 122                  46
organism1 2         0.055            122                  45
organism4 2         6.5              188                  50
organism3 1         1e-2             155                  90
[download]

Is this what you were looking for?

Chris

Comment on Re: Statistics via hash- NCBI BLAST Tab Delimited file Select or Download Code

Replies are listed 'Best First'.
Re^2: Statistics via hash- NCBI BLAST Tab Delimited file by Paragod28 (Novice) on Dec 15, 2009 at 15:40 UTC
That is exactly what I was trying to do. Now that I look over my attempt I can see some of the mistakes you all have pointed out. I appreciate the help from all of you monks. I will go through the examples so I can understand the code behind it. I may have some more questions on just exactly why certain things were done. Thank you all!	[reply]
Re^2: Statistics via hash- NCBI BLAST Tab Delimited file by Cristoforo (Curate) on Dec 15, 2009 at 20:28 UTC
I noticed 2 changes that should be made. `while (<DATA>) { chomp; my ($organism, @vals) = (split /\t/)[2..5]; $freq{ $organism }++;` [download] The line sfter the read needs to be chomped. And, the split should occur on tabs, not on whitespace. (When I ran the test program above, I split on whitespace instead of tabs because the sample data I copied into the program was separated by spaces, not tabs). As long as `$organism` itself doesn't contain spaces you could split on whitespace. But I don't know enough about your data. Chris	[reply] [d/l] [select]
Re^3: Statistics via hash- NCBI BLAST Tab Delimited file by Paragod28 (Novice) on Dec 16, 2009 at 15:27 UTC
Yes I caught that. My data does have whitespace in the organism column. Thanks. I have been trying to work through the code and have another question. How can I add column 2(text) of the original data to the output and keep it referenced to organism?	[reply]
Re^3: Statistics via hash- NCBI BLAST Tab Delimited file by Paragod28 (Novice) on Dec 16, 2009 at 20:37 UTC
Yes you are correct Chris. I was having difficulty keeping column 2(accession) referenced to column 3(organism). I understand the code as the values are pushed into the array but I did not understand how to keep the above referenced to each other. I assume a hash but I was not sure how to implement it along with the other hashes. I was also having a problem with the organism names being too long. I came up with a solution for both. I would rather learn than be given a solution so let me know if this is acceptable practice. Again, thanks for your help. use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; my $ref_filelist = $ARGV[0]; open(BLASTFILE, $ref_filelist ) or die "Could not open Reference filelist...($!)"; while (<BLASTFILE>) { chomp; my ($accession,$organism, @vals) = (split /\t/)[1..5]; my $organismcut = substr( $organism, 0,75 ); my $tot =$accession . $organismcut; #print "$tot\n"; $freq{ $tot }++; my $col = 4; for my $val (@vals) { push @{ $data{$tot} {$col} }, $val; $col++; } } my @headers = qw/ Organism Freq Median_Eval Med_Contig_Length Med_Map +ped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); #print "@row\n"; } print $tb; [download]	[reply] [d/l]
Re^4: Statistics via hash- NCBI BLAST Tab Delimited file by Cristoforo (Curate) on Dec 17, 2009 at 01:33 UTC
I don't think Text::Table is the right tool for this application. It is only really useful for about a page worth of reading. Your first column will be 75 plus ? spaces alone. How do you plan to view the data? A couple of ideas that occured to me would be to create a comma separated values file or use Perl6::Form (like I did in Re: Formatting text, eg long lines). (You could wrap that long first column so it wouldn't run across the page). With a comma separated values file format, you could open the files in Excel, (if you are on a Windows machine), and, if there is a large dataset, (more than a couple of pages), you can freeze the headers in Excel while still scrolling your data. If your results will have many lines of results, with Perl6::Form, you could arrange headers to print after so many lines. ~~I say that about Text::Table because I'm guessing your results will be more than a couple of pages. If not, Text::Table would be OK.~~ If there are alot of rows to print, you might want to print your header for every (50?) lines or so to aid your readers. A way to do it below: Instead of `print $tb` at the end of your script, the following would repeat the header every 50 lines. `my $rows = $tb->body_height(); $pagelines = 50; for my $i (0 .. $rows-1) { print $tb->title() if $i % $pagelines == 0; print $tb->body($i); print "\n" if $i % $pagelines == $pagelines-1; }` [download] Chris Update: set rows to the number of rows in table. I was getting the number of items from `keys %freq`.	[reply] [d/l] [select]