lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hi!
I am using bioperl to parse a blast report. I am having
difficulty parsing only the best hsp(high scoring pair)info
as well as the alignment. I currently can get
the alignment to print(clustalW format) but not the
other info concerning the hsp to an out file.

#!/usr/bin/perl use warnings; use strict; use Bio::SearchIO; use Bio::AlignIO; use strict; my $out = "C:/Documents and Settings/mydir/Desktop/current/bl2seq/1311 +7_aln.out"; my $dir = 'C:/Documents and Settings/mydir/Desktop/current/bl2seq/'; my $in = new Bio::SearchIO(-format => 'blast', -file => $dir.'13117_2.out'); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { if( $hsp->length('total') > 1000 ) { if ( $hsp->percent_identity >= 90 ) { if( $hsp->score ){ if( $hsp->gaps ){ if( $hsp->expect ){ print "Hit= ", $hit->name, " Length=", $hsp->length('total'), " Score=", $hsp->score,"\n"; print "Gaps=", $hsp->gaps, " Expect=", $hsp->expect, " Percent_id=", $hsp->percent_identity, "\n"; my $aln = $hsp->get_aln; my $alnIO = Bio::AlignIO->new(-format =>"clustalw", -file +=> $out); $alnIO->write_aln($aln); } } } } } } } } __DATA__ Query= 13117 Mus musculus genomic with Cassette inserted (10,084 letters) >13117/13117_fasta.Contig1 Length = 1453 Score = 2046 bits (1032), Expect = 0.0 Identities = 1088/1100 (98%), Gaps = 5/1100 (0%) Strand = Plus / Minus + Query: 1001 taggaaccattttttttttttct-gtgtttccttca-tcctgtgactctgccaacc-a +ac 1057 ||||||||||||||||||||| | |||||||||||| |||||||||||| |||||| | +|| Sbjct: 1113 taggaaccatttttttttttttttgtgtttccttcaatcctgtgactctaccaaccca +ac 1054 + Query: 1058 tttctaatccctctacctgcaaactacaaactctggtctgccagcaggttatgaattg +ta 1117 |||||||||| ||||||||||||||||||||||||||||||||||||||||| |||| +|| Sbjct: 1053 tttctaatccttctacctgcaaactacaaactctggtctgccagcaggttataaattt +ta 994 + Query: 1118 aagaaatttcattatgcttttgtctgcaaaatcccccagttctcagaaacaggg-tta +aa 1176 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| +|| Sbjct: 993 aagaaatttcattatgcttttgtctgcaaaatcccccagttctcagaaacaggggtta +aa 934 + Query: 1177 gccaatgtgtagagaagaatgaacaaaa-tgcagaaatgcagggaatcccagtgcctc +at 1235 ||||||| |||||||||||||||||||| |||||||| |||||||||||||||||||| +|| Sbjct: 933 gccaatgggtagagaagaatgaacaaaaatgcagaaaagcagggaatcccagtgcctc +at 874 + Query: 1236 ggggtgggggtggggtgggagtgcagggtcagcagggaccatccagatgctgggacac +cc 1295 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 873 ggggtgggggtggggtgggagtgcagggtcagcagggaccatccagatgctgggacac +cc 814 + Query: 1296 ctgtgaaatatcacaccctctctgccccgtgctgtcctggattcaggtcaaactggat +tt 1355 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 813 ctgtgaaatatcacaccctctctgccccgtgctgtcctggattcaggtcaaactggat +tt 754 + Query: 1356 caatctgcataatttgtcccatttagaagggttgggagagggtgagaacaggagggtg +aa 1415 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 753 caatctgcataatttgtcccatttagaagggttgggagagggtgagaacaggagggtg +aa 694 + Query: 1416 gactgaggaggggggatagatactccccccatcccaaaggcaggcatgaagtagtggc +ca 1475 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 693 gactgaggaggggggatagatactccccccatcccaaaggcaggcatgaagtagtggc +ca 634 + Query: 1476 gaggaaaccagagtgaacaggctggagtgaaggctggtagctgctagagagaggaggt +gg 1535 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 633 gaggaaaccagagtgaacaggctggagtgaaggctggtagctgctagagagaggaggt +gg 574 + Query: 1536 gaccgggaggagcagcaggcttgaaaagcacagggtaattgtcagcccagcactgtag +gg 1595 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 573 gaccgggaggagcagcaggcttgaaaagcacagggtaattgtcagcccagcactgtag +gg 514 + Query: 1596 ggcaatttagtctacttacccagtggcaagaagtaaaaaaaaagtgggctgtgactta +aa 1655 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 513 ggcaatttagtctacttacccagtggcaagaagtaaaaaaaaagtgggctgtgactta +aa 454 + Query: 1656 gatcttaagactcactagagagggaggggttagcctgggagtgggaggagggagtgca +ga 1715 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 453 gatcttaagactcactagagagggaggggttagcctgggagtgggaggagggagtgca +ga 394 + Query: 1716 gtttaggaggagacagagactctggagggtgcaaggaagtgtctttagctgagctggg +ct 1775 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 393 gtttaggaggagacagagactctggagggtgcaaggaagtgtctttagctgagctggg +ct 334 + Query: 1776 gggctgggacgggaaggtagactgaagctgcctgaactgctcatcccgctgggtctga +gc 1835 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 333 gggctgggacgggaaggtagactgaagctgcctgaactgctcatcccgctgggtctga +gc 274 + Query: 1836 tcccagtggggcggttcgcatgcctggaaggactctaagcaccctcagacctcttact +ct 1895 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 273 tcccagtggggcggttcgcatgcctggaaggactctaagcaccctcagacctcttact +ct 214 + Query: 1896 gaggttcaagccaatccttcagggtgcttgctgagttggtgacccatctctggaactg +gt 1955 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 213 gaggttcaagccaatccttcagggtgcttgctgagttggtgacccatctctggaactg +gt 154 + Query: 1956 gtccctgggactcactgcccagcttcacttcattcagcaccatgggtaccgatttaaa +tg 2015 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 153 gtccctgggactcactgcccagcttcacttcattcagcaccatgggtaccgatttaaa +tg 94 + Query: 2016 atccagtggtcctgcagaggagagattgggagaatcccggtgtgacacagctgaacag +ac 2075 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +|| Sbjct: 93 atccagtggtcctgcagaggagagattgggagaatcccggtgtgacacagctgaacag +ac 34 Query: 2076 tagccgcccaccctcccttt 2095 |||||||||||||||||||| Sbjct: 33 tagccgcccaccctcccttt 14 Score = 40.1 bits (20), Expect = 1e-005 Identities = 20/20 (100%) Strand = Plus / Minus Query: 7698 ggggtgggggtggggtggga 7717 |||||||||||||||||||| Sbjct: 873 ggggtgggggtggggtggga 854 Score = 28.2 bits (14), Expect = 0.046 Identities = 14/14 (100%) Strand = Plus / Minus Query: 8817 catttttttttttt 8830 |||||||||||||| Sbjct: 1106 catttttttttttt 1093 Score = 26.3 bits (13), Expect = 0.18 Identities = 13/13 (100%) Strand = Plus / Plus Query: 363 cccctccctctct 375 ||||||||||||| Sbjct: 425 cccctccctctct 437 Score = 26.3 bits (13), Expect = 0.18 Identities = 13/13 (100%) Strand = Plus / Minus Query: 7649 agggtgagaacag 7661 ||||||||||||| Sbjct: 715 agggtgagaacag 703 Score = 24.3 bits (12), Expect = 0.71 Identities = 12/12 (100%) Strand = Plus / Plus Query: 5507 cccccctcctca 5518 |||||||||||| Sbjct: 679 cccccctcctca 690 Score = 24.3 bits (12), Expect = 0.71 Identities = 12/12 (100%) Strand = Plus / Minus Query: 8819 tttttttttttt 8830 |||||||||||| Sbjct: 1101 tttttttttttt 1090 Score = 24.3 bits (12), Expect = 0.71 Identities = 12/12 (100%) Strand = Plus / Minus Query: 8819 tttttttttttt 8830 |||||||||||| Sbjct: 1102 tttttttttttt 1091 Score = 24.3 bits (12), Expect = 0.71 Identities = 12/12 (100%) Strand = Plus / Minus Query: 8819 tttttttttttt 8830 |||||||||||| Sbjct: 1103 tttttttttttt 1092 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Minus Query: 598 tcctgcagagg 608 ||||||||||| Sbjct: 84 tcctgcagagg 74 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Minus Query: 4644 gcctgaactgc 4654 ||||||||||| Sbjct: 304 gcctgaactgc 294 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Minus Query: 4040 ctgcctgaact 4050 ||||||||||| Sbjct: 306 ctgcctgaact 296 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Minus Query: 9957 ttagctgagct 9967 ||||||||||| Sbjct: 349 ttagctgagct 339 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Plus Query: 9036 ctgtctcctcc 9046 ||||||||||| Sbjct: 378 ctgtctcctcc 388 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Plus Query: 7830 tcctcccactc 7840 ||||||||||| Sbjct: 405 tcctcccactc 415 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Minus Query: 368 ccctctctgcc 378 ||||||||||| Sbjct: 798 ccctctctgcc 788 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Plus Query: 9047 cccaccccacc 9057 ||||||||||| Sbjct: 855 cccaccccacc 865 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Plus Query: 9045 cccccacccca 9055 ||||||||||| Sbjct: 864 cccccacccca 874 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Plus Query: 8608 tttctgagaac 8618 ||||||||||| Sbjct: 945 tttctgagaac 955 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Plus Query: 564 aaaaaaaaaat 574 ||||||||||| Sbjct: 1095 aaaaaaaaaat 1105 Score = 22.3 bits (11), Expect = 2.8 Identities = 11/11 (100%) Strand = Plus / Minus Query: 7639 tttgttaagaa 7649 ||||||||||| Sbjct: 1436 tttgttaagaa 1426 Lambda K H 1.37 0.711 1.31 Gapped Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Sequences: 1 Number of Hits to DB: 562 Number of extensions: 37 Number of successful extensions: 37 Number of sequences better than 10.0: 1 Number of HSP's gapped: 24 Number of HSP's successfully gapped: 21 Length of query: 10084 Length of database: 1453 Length adjustment: 12 Effective length of query: 10072 Effective length of database: 1441 Effective search space: 14513752 Effective search space used: 14513752 X1: 11 (21.8 bits) X2: 15 (29.7 bits) X3: 50 (99.1 bits) S1: 11 (22.3 bits) S2: 11 (22.3 bits)

I am only interested in parsing out the first hit, which is
the best hit with greater than 98% identity. Any ideas?

Replies are listed 'Best First'.
Re: parsing blast output
by roboticus (Chancellor) on May 01, 2009 at 22:40 UTC
    IomSpace:

    Perhaps you should try putting the label OUTER: before the outer while statement, and last OUTER; after the $alnIO->write_aln($aln); statement.

    ...roboticus
Re: parsing blast output
by mikeraz (Friar) on May 02, 2009 at 00:19 UTC

    <Dicken's Character Voice>
    Please sir, might I have a little <readmore> tag?
    Just one sir? Please?
    You could find one here sir.
    <Dicken's Character Voice/>


    Be Appropriate && Follow Your Curiosity
Re: parsing blast output
by Marshall (Canon) on May 04, 2009 at 02:09 UTC
    First I would suggest simplifying the code. You have a bunch of "if" statements that are essentially "and" conditions. Your code will work, but this introduces more levels of {} which is not good.
    if( $hsp->length('total') > 1000 ) { if ( $hsp->percent_identity >= 90 ) { if( $hsp->score ){ if( $hsp->gaps ){ if( $hsp->expect ){ print "Hit= ", $hit->name, " Length=", $hsp->length('total'), " Score=", $hsp->score,"\n"; print "Gaps=", $hsp->gaps, " Expect=", $hsp->expect, " Percent_id=", $hsp->percent_identity, "\n"; my $aln = $hsp->get_aln; my $alnIO = Bio::AlignIO->new(-format =>"clustalw", -file +=> $out); $alnIO->write_aln($aln); } } } } } ## better is to reduce the levels of "if" by using "and" or && if ( ($hsp->length('total') > 1000 ) && ($hsp->percent_identity >= 90 ) && ($hsp->score ) && ($hsp->gaps) && ($hsp->expect) ) { #do your print stuff }
    In my opinion when I see eight consecutive "}"'s in a row, I figure that something is wrong with the code - too many levels of indentation!

    I was looking at your while loop and it is hard for me to say exactly how to make that better as there can be a lot of cases here and I am unable to run your current code without installing some modules.

    Can you simplify your question?

      #!/usr/bin/perl use warnings; use strict; use Bio::SearchIO; use Bio::AlignIO; my $out = ">C:/Documents and Settings/mgavi.brathwaite/Desktop/current +/bl2seq/13117_aln1.out"; my $dir = 'C:/Documents and Settings/mgavi.brathwaite/Desktop/current/ +bl2seq/'; my $in = new Bio::SearchIO(-format => 'blast', -file => $dir.'13117_2.out', -report_type => 'bl2seq'); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { if( $hsp->length('total') > 1000 ) { if ( $hsp->percent_identity >= 90 ) { if( $hsp->score ){ if( $hsp->gaps ){ if( $hsp->expect ){ print "Hit= ", $hit->name, " Length=", $hsp->length('total'), " Score=", $hsp->score,"\n"; print " Gaps=", $hsp->gaps, " Expect=", $hsp->expect, " Percent_id=", $hsp->percent_identity, "\n"; =cut # $hit->name, $hit->hsps, $hit->name, $hsp->length('total'), $hsp->score, $hsp->gaps, $hsp->expect, $hsp->percent_identity; =cut my $aln = $hsp->get_aln; my $alnIO = Bio::AlignIO->new(-format =>"clustalw" +, -file => $out); $alnIO->write_aln($aln); } } } } } } } }
      I can get the alignment, but I also need the print data
      to be prepended at the top of the $out file.
        So, if I understand the question, you've got what you want except that instead of print going to screen, you want that print to go to the file that the Bio module is writing to?

        First one thing that might be helpful here is to do a $|=1; at the top of the program, this turns Autoflush on. This slows the output down a bit, but until you get this working, this might be a good idea to do.

        I'm not familiar with the module that you are using, but one suggestion is to see if you can give this Bio thing a filehandle instead of a text filepath. If that's true, then you open the output file, use the handle for your printing and give the handle to Bio for its printing. This is the easiest thing if this module allows it.(eg. open say $outhandle to the $out filepath (open ($outhandle, ">", "path) or die .....) In Perl you can use a scalar as a filehandle instead of a bareword like OUT)..

        Anyway before speculating about more complex scenarios, try the simple approach first.