cumurph has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a sequence and db I made myself. I have to run blast on subsequences of length 20 and identify the ones that are unique to my sequence. Any suggestion on how to go about doing this? I can get blast to "blast" my sequence, at length 20, but I'm having lots of trouble with parsing. There are so many parameters to set up. Any suggstions on which ones I need to accomplish my goal. The current code I'm using gives me 1000's of OUT files. There has to be a better way than having to read get of the OUT files. Below is the code I'm using. Thank you in advance.
#!/usr/bin/perl; use strict; # installed bioperl locally, use local bioperl lib #use lib="/home/murphyr/lib/"; use Bio::Seq; use Bio::SeqIO::fasta; use Bio::Tools::Blast; use Bio::Tools::Run::StandAloneBlast; use Bio::Tools::SeqStats; use Data::Dumper; print "Program to blast subsequences \n"; print "Enter File name of Sequence in FASTA Format \n"; my $inputfile = <STDIN>; print "$inputfile \n"; my $inseq1 = Bio::SeqIO ->new('-file' => "$inputfile", '-format'=> 'fa +sta'); my $outseq1 = new Bio::SeqIO (-fh => \*STDOUT, -format => 'fasta'); my $seq1 = $inseq1 -> next_seq(); my $seq_stats1 = Bio::Tools::SeqStats->new(-seq=>$seq1); my $monomer_ref = $seq_stats1 -> count_monomers(); my @k = keys %$monomer_ref; my @v = values %$monomer_ref; print "@k\n"; print "@v\n"; my $mytotal = 0; foreach my $value (@v) { $mytotal = $mytotal + $value; } # print "$seq1 -> seq()\n"; my $did = $seq1 -> display_id(); print "$did\n"; print "Here \n"; # print $outseq1 -> write_seq($seq1); my $start_pos = 0; my $end_pos = 24; while ($end_pos != $mytotal) { $start_pos = $start_pos + 1; $end_pos = $end_pos + 1; my $subseq = $seq1 -> subseq($start_pos, $end_pos); print "$subseq \n"; print "Now doing blast on $subseq \n"; my @params = ('program' => 'blastn', 'database' => 'virus-db', 'outfile' => "$subseq.out"); my $factory = Bio::Tools::Run::StandAloneBlast->new(@params); my $subinput = Bio::Seq->new('-id'=>"sub test query", '-seq'=>"$subseq"); my $blast_report = $factory->blastall($subinput); # Get the report my $searchio = new Bio::SearchIO ('-format' => 'blast', '-file' => "$subseq.out"); my $result = $searchio->next_result; # Get info about the entire report $result->database_name; my $algorithm_type = $result->algorithm; # get info about the first hit my $hit = $result->next_hit; my $hit_name = $hit->name ; # get info about the first hsp of the first hit my $hsp = $hit->next_hsp; my $hsp_start = $hsp->query->start; print "Now parsing the blast reports using SearchIO \n"; # # Not being selective here ... print the whole she-bang! ... # print "Algorithm Type \n"; print "$algorithm_type \n"; while ( (my $khit, my $vhit) = each %{$hit}) { print "$khit => $vhit \n"; } print Dumper(\%$hit); print "Hit Name \n"; print "$hit_name\n"; while ( (my $khit,my $vhit) = each %{$hsp}) { if ($khit eq "_evalue") { print "$khit => $vhit \n"; } }

Replies are listed 'Best First'.
Re: Parsing BLAST
by bowei_99 (Friar) on Apr 24, 2006 at 03:43 UTC
    Any suggestion on how to go about doing this?
    Well, you've got a lot of code that's not broken up into subroutines, so it makes it a lot less readable and maintainable, e.g. harder to figure out exactly what you're doing. I assume that you're looking to see how well it performs. If so, you can use Devel::DProf. Or, if you have two ways of doing something, you can compare them using Benchmark.


    ... but I'm having lots of trouble with parsing.
    What kind of trouble? Is it slow? Giving errors? Taking up too much resources? Unexpected results?


    There are so many parameters to set up. Any suggstions on which ones I need to accomplish my goal.

    A few things:

    • Is your code working? Seems to me like you're using all your variables. In terms of whether you need to go through all those steps, I don't know... this is definitely a case where breaking it up into subroutines could not just encapsulate your variables to a given subroutine, but help to let you know whether you really need it. I assume you've read the docs on cpan, like Bio::Seq?

    • Your definitions
      my @k = keys %$monomer_ref; my @v = values %$monomer_ref;

      are redundant... if you use keys %$monomer_ref; or values %$monomer_ref instead, it would be one less thing to set up.

    • I'd suggest using Config::Std, or any of the other modules out there that lets you set configuration parameters in an initialization file. You could also use GetOpt::Long, if you like to set things from the command line.


    The current code I'm using gives me 1000's of OUT files. There has to be a better way than having to read get of the OUT files.
    Well, is there a reason you have to use so many files in the first place? I've found Log::Std to be very useful for logging information.

    -- Burvil

Re: Parsing BLAST
by educated_foo (Vicar) on Apr 24, 2006 at 05:52 UTC
    Yeah: don't use Bioperl. IMHO it's overengineered and painful. In the past I've used Boulder::Blast with some success, but in the end I found it easier to just hand-roll a parser to extract the parts I cared about. Test it well -- BLAST format is ugly!
Re: Parsing BLAST
by MadraghRua (Vicar) on Apr 24, 2006 at 19:19 UTC
    Depends on what you're trying to do. OK, you appear to be searching for 20mers against a virus library. Is the purpose to identify how often a given 20mer comes up in the library? To see if the 20mer is unique in a given sequence? To see if the 20mer is represented at all in the library? To map the position of 20mer hits to features within the library? Answers depend upon the purpose.

    I suggest that you go to the Pasteur web site that provides BioPerl training and have a look at the examples they give around doing Blast - it has several very good examples on how to do this with variations on the parameters. I agree that BioPerl can be a bit of a beast at the beginning but I happen to like it. The alternative is try out a copy of the Tisdall book

    Another question - given that you're looking for 20mers, is BLAST even the best tool to be using for this exercise? You're going to end up with many hits (they're 20mers after all and they're everywhere) and many HSPs based upon each individual hit and your e-values are going to be crap.

    Given this would you maybe be better in taking your library of sequences, walking down it 20 bases at a time and scoring each 20mer pattern as you observe it? This is more by way of a regex approach to the problem. Each 20mer is represented as a key in a hash and you simply increment by one every time you get a new pattern occurring.

    If you're trying something more ambitious, you'll need to provide more information.

    MadraghRua
    yet another biologist hacking perl....

      I am trying to find which 20mer's are unique to my sequence. I've read the stuff at pasteur and its doesn't really seem to help me for my particular problem. I also have to do this search using FASTA, and have no clue even where to start with that., but that's another bird to kill. thanks!
        I always parse blast in its -m 8 or -m 9 tabular output format. Much easier to parse.
        Unless your sequence is quite large (and so you have many thousands of unique 20mers), I would go the hash route. It will be VERY fast if memory isn't limiting. If that isn't feasible, break your sequence into fasta sequences of size 20 base pairs and give each a unique ID. Then, blast away using tabular output. Then, you can parse to your heart's content using simple perl.

        Sean
        Is this homework?
A reply falls below the community's threshold of quality. You may see it by logging in.