AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I am trying to write a script to parse some tabular BLAST output. Each Query has hits to many different subjects. The 'goodness' of the hit is measured by the bits score. While parsing the BLAST output firstly, I assign a 'Flag' to each Query/Subject/bit_score.

It then gets complicated as I want to compare the Flags assigned to each Query/Subject/bitscore for those with scores within 10% of the top bitscore for that Query. The script is working up until ##### For each Query_id find top bit score, and I am obviously doing something wrong trying to access the hash as the bit_score is now undefined.

Here is my current 'working' code. You can see that I'm not quite sure what to do after the easy part. Any suggestions? Thanks

sample input

158256496-stool1_revised_C989779_1_gene1513 strand:+ 203907 +.Bfl015 63.04 46 17 0 1 46 1 46 + 1e-07 59.3 + 158256496-stool1_revised_C989779_1_gene1513 strand:+ 326427 +.Cagg_1194 65.22 46 16 0 2 47 3 + 48 1e-07 59.3 + 158256496-stool1_revised_C989779_1_gene1513 strand:+ 334413 +.FMG_1629 70.45 44 13 0 1 44 4 47 + 1e-07 59.3 + 158256496-stool1_revised_C989779_1_gene1513 strand:+ 224915 +.bbp013 63.04 46 17 0 1 46 1 46 + 1e-07 59.3 + 158256496-stool1_revised_C989779_1_gene1513 strand:+ 543302 +.AaLAA1DRAFT_2702 72.09 43 12 0 1 43 29 + 71 1e-07 59.3 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 484018 +.BACPLE_00501 77.43 505 114 0 1 505 1 + 505 0.0 848 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 547042 +.BACCOPRO_00289 77.43 505 114 0 1 505 5 + 509 0.0 848 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 449673 +.BACSTE_02782 77.56 508 111 1 1 505 1 + 508 0.0 847 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 665953 +.HMPREF1016_02618 77.76 508 110 1 1 505 1 + 508 0.0 845 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 762984 +.HMPREF9445_02115 77.17 508 113 1 1 505 1 + 508 0.0 845 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 702446 +.CUU_1632 77.03 505 116 0 1 505 1 505 + 0.0 845 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 469593 +.HMPREF9011_00038 77.03 505 116 0 1 505 1 + 505 0.0 845 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 457394 +.BSFG_02153 77.03 505 116 0 1 505 1 + 505 0.0 845 + 158256496-stool1_revised_C989779_1_gene1517 strand:+ 435590 +.BVU_1475 77.03 505 116 0 1 505 1 505 + 0.0 845

work in progress code

############### read in blast table and parse ####### my $in_blast_tab=$ARGV[0]; open(IN,$in_blast_tab) or die "cannot open $in_blast_tab\n"; my %HoFlg1; my %HoFlg2; #my $Flag1; my $Flag2; while (my $line=<IN>) { chomp $line; my ($Query_id,$strand,$Subj_id,$Perc_iden,$align_len,$num_mm,$ +gap,$q_start,$q_end,$s_start,$s_end,$e_value,$bit_score)=split("\t",$ +line); # extra field of strand next if ($bit_score <60); my ($Flag1, $Flag2 ) = &Flag( $Subj_id, \%proph_prots, \%euk_p +rots, \%vir_prots ); $HoFlg1{$Query_id}{$Subj_id}{$bit_score}=$Flag1; $HoFlg2{$Query_id}{$Subj_id}{$bit_score}=$Flag2; print join("\t",$Query_id,$Subj_id,$bit_score,$Flag1,$Flag2)." +\n"; } sub Flag { my $Subj_id=$_[0]; my $proph_prots=$_[1]; my $euk_prots=$_[2]; my $vir_prots=$_[3]; my $Flag1; my $Flag2; if (exists $proph_prots{$Subj_id}){ $Flag1="Proph"; $Flag2="Phage"; } elsif (exists $euk_prots{$Subj_id}){ $Flag1="Euk"; $Flag2="Euk" } elsif(exists $vir_prots{$Subj_id}){ $Flag1="Vir"; $Flag2="Phage"; } else { $Flag1="Bact"; $Flag2="Bact"; } return($Flag1,$Flag2); } ##### For each Query_id find top bit score my %top_bit; foreach my $key1 (sort keys %HoFlg2){ foreach my $key2 (keys %{$HoFlg2{$key1}}){ print "key1\t$key1\tkey2$key2\tbitscore\t $HoF +lg2{$key1}{$key2}\n" $top_bit{$key1}=$HoFlg2{$key1}{$key2}; #??prob +ably wrong } print "for query\t$key1 top bit score is\t$top_bit{$key1}\n"; }

Replies are listed 'Best First'.
Re: help with hash or arrays or hash references
by Anonymous Monk on May 08, 2012 at 12:53 UTC

    That big if/elsif/elsif/else block is broken, here is why/how

    $ perl -e " use strict; use warnings; if( 0 ){ my $foo = 1; } else { $ +foo = 2; } " Global symbol "$foo" requires explicit package name at -e line 1. Execution of -e aborted due to compilation errors. $ perl -e " use strict; use warnings; if( 1 ){ my $foo = 1; } else { $ +foo = 2; } " Global symbol "$foo" requires explicit package name at -e line 1. Execution of -e aborted due to compilation errors.

    The  my $foo only exists inside that first if-block, all the other $foo refer to the undeclared global variable $foo

    somebody told you to use strict, and you gave it a shot, but since you did not understand the benefits, you eventually turned it off , and now your program is broken

    The shortest fix to satisfy strict is to move  my $foo outside the if/else/block, as in

    $ perl -e " use strict; use warnings; my $foo; if( 0 ){ $foo = 1; } el +se { $foo = 2; } "

    More on scope/strict in Lexical scoping like a fox, Read this if you want to cut your development time in half! and understand that  strict itself confers no benefits; The benefits come from avoidance of the bad practices forbidden by  strict :)

    My second piece of advice, is turn that if/else block into a function, maybe like so

    my( $Flag1, $Flag2 ) = ProtsFlags( $Subj_id, \%proph_prots, \%euk_ +prots, \%vir_prots ); $HoFlg1{$Query_id}{$Subj_id}{$bits} = $Flag1; $HoFlg2{$Query_id}{$Subj_id}{$bits} = $Flag2;

      Great Thanks. I had just realized the that I had to move the declaration to outside of the if else loop. I also like the idea of a subroutine. Subroutine is now working. Here working code for any following this thread

      new code

      ############### read in blast table and parse ####### my $in_blast_tab=$ARGV[0]; open(IN,$in_blast_tab) or die "cannot open $in_blast_tab\n"; my %HoFlg1; my %HoFlg2; #my $Flag1; my $Flag2; while (my $line=<IN>) { chomp $line; my ($Query_id,$strand,$Subj_id,$Perc_iden,$align_len,$num_mm,$ +gap,$q_start,$q_end,$s_start,$s_end,$e_value,$bit_score)=split("\t",$ +line); # extra field of strand next if ($bit_score <60); my ($Flag1, $Flag2 ) = &Flag( $Subj_id, \%proph_prots, \%euk_p +rots, \%vir_prots ); $HoFlg1{$Query_id}{$Subj_id}{$bit_score}=$Flag1; $HoFlg2{$Query_id}{$Subj_id}{$bit_score}=$Flag2; print join("\t",$Query_id,$Subj_id,$bit_score,$Flag1,$Flag2)." +\n"; } sub Flag { my $Subj_id=$_[0]; my $proph_prots=$_[1]; my $euk_prots=$_[2]; my $vir_prots=$_[3]; my $Flag1; my $Flag2; if (exists $proph_prots{$Subj_id}){ $Flag1="Proph"; $Flag2="Phage"; } elsif (exists $euk_prots{$Subj_id}){ $Flag1="Euk"; $Flag2="Euk" } elsif(exists $vir_prots{$Subj_id}){ $Flag1="Vir"; $Flag2="Phage"; } else { $Flag1="Bact"; $Flag2="Bact"; } return($Flag1,$Flag2); }
        The hashes, %proph_prots, %euk_prots, %vir_prots are not initialized with any values.

        If the only parameters you are using from the file are:
        $Query_id, $Subj_id, $bit_score,
        you could make your initialization line as:
        my ($Query_id, $Subj_id, $bit_score) = (split /\t/, $line)[0, 2, -1];

        Update: Those are hash refs in the Flag() function but they are being used with regular hash notation/syntax. (You have not declared use strict; at the beginning of the script. Thats how I found the uninitialized hashes mentioned above.)

        By chance, are ALL your resulting Flags "Bact"?

        You are passing in hash references, but not dereferencing them; thus it seems your IFs will always fail.

        Seems to me you should have this:

        #if (exists $proph_prots{$Subj_id}){ if (exists $$proph_prots{$Subj_id}){

        And so on.

        Here is another way to write that, less typing, fixed to dereference properly ( references quick reference ) , and filtered through perltidy -csc -otr -opr -ce -nibc -i=4

        sub Flag { my ( $Subj_id, $proph_prots, $euk_prots, $vir_prots ) = @_; return "Proph", "Phage" if exists $$proph_prots{$Subj_id}; return "Euk", "Euk" if exists $$euk_prots{$Subj_id}; return "Vir", "Phage" if exists $$vir_prots{$Subj_id}; return "Bact", "Bact"; } ## end sub Flag