in reply to Re: Statistics via hash- NCBI BLAST Tab Delimited file
in thread Statistics via hash- NCBI BLAST Tab Delimited file

I noticed 2 changes that should be made.
while (<DATA>) { chomp; my ($organism, @vals) = (split /\t/)[2..5]; $freq{ $organism }++;

The line sfter the read needs to be chomped. And, the split should occur on tabs, not on whitespace. (When I ran the test program above, I split on whitespace instead of tabs because the sample data I copied into the program was separated by spaces, not tabs).

As long as $organism itself doesn't contain spaces you could split on whitespace. But I don't know enough about your data.

Chris

Replies are listed 'Best First'.
Re^3: Statistics via hash- NCBI BLAST Tab Delimited file
by Paragod28 (Novice) on Dec 16, 2009 at 15:27 UTC

    Yes I caught that. My data does have whitespace in the organism column. Thanks. I have been trying to work through the code and have another question. How can I add column 2(text) of the original data to the output and keep it referenced to organism?

Re^3: Statistics via hash- NCBI BLAST Tab Delimited file
by Paragod28 (Novice) on Dec 16, 2009 at 20:37 UTC

    Yes you are correct Chris. I was having difficulty keeping column 2(accession) referenced to column 3(organism). I understand the code as the values are pushed into the array but I did not understand how to keep the above referenced to each other. I assume a hash but I was not sure how to implement it along with the other hashes. I was also having a problem with the organism names being too long. I came up with a solution for both. I would rather learn than be given a solution so let me know if this is acceptable practice. Again, thanks for your help.

    use strict; use warnings; use Acme::Tools; use Text::Table; my %data; my %freq; my $ref_filelist = $ARGV[0]; open(BLASTFILE, $ref_filelist ) or die "Could not open Reference filelist...($!)"; while (<BLASTFILE>) { chomp; my ($accession,$organism, @vals) = (split /\t/)[1..5]; my $organismcut = substr( $organism, 0,75 ); my $tot =$accession . $organismcut; #print "$tot\n"; $freq{ $tot }++; my $col = 4; for my $val (@vals) { push @{ $data{$tot} {$col} }, $val; $col++; } } my @headers = qw/ Organism Freq Median_Eval Med_Contig_Length Med_Map +ped_Length /; my $tb = Text::Table->new( map {title => $_}, @headers); for my $test (sort {$freq{ $b } <=> $freq{ $a }} keys %freq) { my @row = ($test, $freq{ $test }); for my $col (sort keys %{ $data{$test} }) { push @row, median(@{ $data{$test}{$col} }); } $tb->load( [@row] ); #print "@row\n"; } print $tb;
      I don't think Text::Table is the right tool for this application. It is only really useful for about a page worth of reading. Your first column will be 75 plus ? spaces alone. How do you plan to view the data? A couple of ideas that occured to me would be to create a comma separated values file or use Perl6::Form (like I did in Re: Formatting text, eg long lines). (You could wrap that long first column so it wouldn't run across the page).

      With a comma separated values file format, you could open the files in Excel, (if you are on a Windows machine), and, if there is a large dataset, (more than a couple of pages), you can freeze the headers in Excel while still scrolling your data.

      If your results will have many lines of results, with Perl6::Form, you could arrange headers to print after so many lines.

      I say that about Text::Table because I'm guessing your results will be more than a couple of pages. If not, Text::Table would be OK.

      If there are alot of rows to print, you might want to print your header for every (50?) lines or so to aid your readers. A way to do it below:

      Instead of print $tb at the end of your script, the following would repeat the header every 50 lines.

      my $rows = $tb->body_height(); $pagelines = 50; for my $i (0 .. $rows-1) { print $tb->title() if $i % $pagelines == 0; print $tb->body($i); print "\n" if $i % $pagelines == $pagelines-1; }

      Chris

      Update: set rows to the number of rows in table.
      I was getting the number of items from keys %freq.