in reply to Statistics via hash- NCBI BLAST Tab Delimited file

Using toolic's code from Statistics on Tab Delimited File, I was able to adapt to what you seem to want.
use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; while (<DATA>) { my ($organism, @vals) = (split)[2..5]; $freq{ $organism }++; my $col = 3; for my $val (@vals) { push @{ $data{$organism}{$col} }, $val; $col++; } } my @headers = qw/ Organism Frequency Median_Eval Median_Contig_Length + Median_Mapped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers ); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); } print $tb; __DATA__ contig1 AC344 organism1 1e-1 122 45 contig1 AC344 organism1 1e-2 122 45 contig1 AC346 organism2 1e-102 122 46 contig1 Ac346 organism2 1e-100 122 46 contig1 Ac346 organism2 1e-114 122 46 contig1 Ac346 organism2 1e-111 122 46 contig2 NC333 organism3 1e-2 155 90 contig3 NC444 organism4 1 188 50 contig3 NC444 organism4 12 188 50
Output was
C:\perlp>812293.pl Organism Frequency Median_Eval Median_Contig_Length Median_Mappe +d_Length organism2 4 5.000000005e-103 122 46 organism1 2 0.055 122 45 organism4 2 6.5 188 50 organism3 1 1e-2 155 90
Is this what you were looking for?

Chris

Replies are listed 'Best First'.
Re^2: Statistics via hash- NCBI BLAST Tab Delimited file
by Paragod28 (Novice) on Dec 15, 2009 at 15:40 UTC

    That is exactly what I was trying to do. Now that I look over my attempt I can see some of the mistakes you all have pointed out. I appreciate the help from all of you monks. I will go through the examples so I can understand the code behind it. I may have some more questions on just exactly why certain things were done. Thank you all!

Re^2: Statistics via hash- NCBI BLAST Tab Delimited file
by Cristoforo (Curate) on Dec 15, 2009 at 20:28 UTC
    I noticed 2 changes that should be made.
    while (<DATA>) { chomp; my ($organism, @vals) = (split /\t/)[2..5]; $freq{ $organism }++;

    The line sfter the read needs to be chomped. And, the split should occur on tabs, not on whitespace. (When I ran the test program above, I split on whitespace instead of tabs because the sample data I copied into the program was separated by spaces, not tabs).

    As long as $organism itself doesn't contain spaces you could split on whitespace. But I don't know enough about your data.

    Chris

      Yes I caught that. My data does have whitespace in the organism column. Thanks. I have been trying to work through the code and have another question. How can I add column 2(text) of the original data to the output and keep it referenced to organism?

      Yes you are correct Chris. I was having difficulty keeping column 2(accession) referenced to column 3(organism). I understand the code as the values are pushed into the array but I did not understand how to keep the above referenced to each other. I assume a hash but I was not sure how to implement it along with the other hashes. I was also having a problem with the organism names being too long. I came up with a solution for both. I would rather learn than be given a solution so let me know if this is acceptable practice. Again, thanks for your help.

      use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; my $ref_filelist = $ARGV[0]; open(BLASTFILE, $ref_filelist ) or die "Could not open Reference filelist...($!)"; while (<BLASTFILE>) { chomp; my ($accession,$organism, @vals) = (split /\t/)[1..5]; my $organismcut = substr( $organism, 0,75 ); my $tot =$accession . $organismcut; #print "$tot\n"; $freq{ $tot }++; my $col = 4; for my $val (@vals) { push @{ $data{$tot} {$col} }, $val; $col++; } } my @headers = qw/ Organism Freq Median_Eval Med_Contig_Length Med_Map +ped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); #print "@row\n"; } print $tb;
        I don't think Text::Table is the right tool for this application. It is only really useful for about a page worth of reading. Your first column will be 75 plus ? spaces alone. How do you plan to view the data? A couple of ideas that occured to me would be to create a comma separated values file or use Perl6::Form (like I did in Re: Formatting text, eg long lines). (You could wrap that long first column so it wouldn't run across the page).

        With a comma separated values file format, you could open the files in Excel, (if you are on a Windows machine), and, if there is a large dataset, (more than a couple of pages), you can freeze the headers in Excel while still scrolling your data.

        If your results will have many lines of results, with Perl6::Form, you could arrange headers to print after so many lines.

        I say that about Text::Table because I'm guessing your results will be more than a couple of pages. If not, Text::Table would be OK.

        If there are alot of rows to print, you might want to print your header for every (50?) lines or so to aid your readers. A way to do it below:

        Instead of print $tb at the end of your script, the following would repeat the header every 50 lines.

        my $rows = $tb->body_height(); $pagelines = 50; for my $i (0 .. $rows-1) { print $tb->title() if $i % $pagelines == 0; print $tb->body($i); print "\n" if $i % $pagelines == $pagelines-1; }

        Chris

        Update: set rows to the number of rows in table.
        I was getting the number of items from keys %freq.