Tjuh has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone :D

For a school project, I'm trying to create an iterative blastn progam. The idea is that a fasta file is n-blasted against a database created from that same fasta file.

So in the first round if there are any hits found, those hit id's with their respective sequences are used in the next round of blast. This will go on until the roundlimit is reached or there are no more hits found. The thing is that there are several iterations of hits. Many of those hits will be repeated in more than one iteration.

The end result would be a hash containing all the uniq hits found in the blast rounds. Sequences would be retrieved by the hit id's and finally a fasta file of highly similar sequences would have been made.

So far I have a working example although I'm not sure if does exactly what I want it to. Especially the section "#get header and sequence..." I'm not too sure if I'm doing that right. I could use some feed back on this or the whole progam in general.

Thank u very much.

sub iterblast{ # settings my $myblast = "blastall -p blastn -m 8"; my $starterseq = "/share/home/tjuh/Thema12-2011/Project/testblast/ +matchedhits.fasta"; my $blastdb = "/share/home/tjuh/Thema12-2011/Project/databases/ +termite_hindgut/hindgut_db"; my $outputroot = "/share/home/tjuh/Thema12-2011/Project/testblast/ +test1"; my $evalue = 0.01; my $wordsize = 11; my $roundlimit = 4; # output files: my $seq_db_fasta = $outputroot . ".tmp"; my $seq_db_all = $outputroot . ".all"; my $logfile = $outputroot . ".log"; # remove existing files system("rm -rf $seq_db_fasta"); system("rm -rf $seq_db_all"); system("rm -rf $logfile"); open(LOG,">$logfile"); print LOG "Configurations:\n"; print LOG "Input file : $starterseq\n"; print LOG "Output files : $outputroot\-\n"; print LOG "Blast Engine : $myblast\n"; print LOG "Target database : $blastdb\n"; print LOG "E-value cut-off : $evalue\n"; print LOG "Blast word size : $wordsize\n"; print LOG "Blast round limit: $roundlimit\n\n"; print LOG "Runs:\n"; # hits hash my %hits = (); my %queryids = (); # Start the iterative blastn rounds for (my $round = 1; $round <= $roundlimit; $round++) { my $hitcount = 0; my $infile = ""; # the initial fasta file is only run once if ($round > 1) { $infile = $seq_db_fasta; } else { $infile = $starterseq; } my $saveblst = $outputroot . ".txt"; system("rm -rf $saveblst"); # print LOG info my $curtime = `date`; chomp($curtime); print LOG "$curtime - Blast seq: $infile\n"; print LOG "$curtime - Blast db : $blastdb\n"; # run blast system("$myblast -i $infile -d $blastdb -o $saveblst -e $evalu +e -W $wordsize"); open(PARSE, "<$saveblst"); open(FASTA, ">$seq_db_fasta"); # parse blast output (tab form) and add matched id's in hi +ts hash my @lines = (); while (my $line = <PARSE>){ chomp $line; @lines = split(/\t/, $line); $queryids{$lines[0]}++; $hits{$lines[1]}++; } #print "Hash size queryids: ", scalar keys %queryids, "\n" +; #print "Hash size hits: ", scalar keys %hits, "\n"; # get header and sequence with 'fastacmd' and write to new + fasta file while (my $key = each %hits){ #print $key . "\n"; if (exists($queryids{$key})){ # only writes uniq ids t +o file #print $key . "\n"; $queryids{$key} = 1; } else { system("fastacmd -d $blastdb -s 'lcl|$key' >> $seq +_db_fasta"); $hitcount++; } } print LOG "Round $round has $hitcount unique hits\n\n"; # evaluate if to run for next blast round if ($hitcount == 0) { print LOG "Hitcount: $hitcount through Round $round == +> STOP\n"; $round = $roundlimit; } close(FASTA); close(PARSE); } print "Total uniq hits found in all rounds: ", scalar keys %hits, +"\n"; close(LOG); }

Replies are listed 'Best First'.
Re: Iterative BLAST(N) program
by toolic (Bishop) on Jun 07, 2011 at 20:48 UTC
    So far I have a working example although I'm not sure if does exactly what I want it to.
    That's what tests are for.
    I could use some feed back on this or the whole progam in general.
    • autodie to check your open/close/system.
    • You open FASTA, but never write to it. Update: tricky... you do write to the file via system. So, there is no need to open the FASTA filehandle.
    • perlcritic for other style issues.
Re: Iterative BLAST(N) program
by GrandFather (Saint) on Jun 08, 2011 at 10:33 UTC

    Use lexical file handles, three parameter open and test the result of open:

    open my $log, '>', $logfile or die "Failed to create $logfile: $!\n";

    Hashes and arrays are born empty so there is no need to explicitly flush them with my %hits = (); (for example). A simple my %hits; gets the job done and results in less clutter.

    Use Perl loops. They are easier to understand and are less error prone than C style for loops. Instead of:

    for (my $round = 1; $round <= $roundlimit; $round++) {

    use:

    for my $round (1 .. $roundlimit) {

    and instead of monkeying around with the loop counter to exit the loop simply use last:

    if ($hitcount == 0) { print $log "Hitcount: $hitcount through Round $round ==> S +TOP\n"; last; }

    Don't worry about the files getting closed. The lexical file handles going out of scope will take care of that for you.

    True laziness is hard work

      Thanks for the reply.

      So the part at "# get header and sequence with 'fastacmd' and write to new fasta file" is handled correctly? Cuz I'm trying to get only new hits to use in the next rounds and so on.

        I can't tell because I don't know what the lines you are dealing with look like. I can only help with overall structure and technique. On the face of it the handling looks OK, but it does depend very much on the content of the first two fields on each line. Can the letter case of the contents be different or can numeric representation be different between the fields for example?

        I've applied the changes I suggested in my reply along with a few others to give:

        #!/usr/bin/perl use strict; use warnings; sub iterblast { # settings my $myblast = "blastall -p blastn -m 8"; my $starterseq = "/share/home/tjuh/Thema12-2011/Project/testblast/matchedhits.f +asta"; my $blastdb = "/share/home/tjuh/Thema12-2011/Project/databases/termite_hindg +ut/hindgut_db"; my $outputroot = "/share/home/tjuh/Thema12-2011/Project/testblast/ +test1"; my $evalue = 0.01; my $wordsize = 11; my $roundlimit = 4; # output files: my $seq_db_fasta = $outputroot . ".tmp"; my $seq_db_all = $outputroot . ".all"; my $logfile = $outputroot . ".log"; # remove existing files unlink $seq_db_fasta; unlink $seq_db_all; unlink $logfile; open my $log, '>', $logfile or die "Can't open $logfile: $!\n"; print $log <<LOGMSG; Configurations: Input file : $starterseq Output files : $outputroot\- Blast Engine : $myblast Target database : $blastdb E-value cut-off : $evalue Blast word size : $wordsize Blast round limit: $roundlimit Runs: LOGMSG # hits hash my %hits; my %queryids; # Start the iterative blastn rounds for my $round (1 .. $roundlimit) { my $saveblst = $outputroot . ".txt"; unlink $saveblst; # print $log info my $curtime = localtime(); my $infile = $round == 1 ? $starterseq : $seq_db_fasta; print $log "$curtime - Blast seq: $infile\n"; print $log "$curtime - Blast db : $blastdb\n"; # run blast system( "$myblast -i $infile -d $blastdb -o $saveblst -e $evalue - +W $wordsize" ); open my $parse, '<', $saveblst or die "Can't open $saveblst: $ +!\n"; open my $fasta, '>', $seq_db_fasta or die "Can't create $seq_db_fasta: $!\n"; # parse blast output (tab form) and add matched id's in hits h +ash while (defined (my $line = <$parse>)) { chomp $line; my @fields = split(/\t/, $line); ++$queryids{$fields[0]}; ++$hits{$fields[1]}; } # get header and sequence with 'fastacmd' and write to new fas +ta file my $hitcount; while (my $key = each %hits) { next if exists $queryids{$key}; # only write uniq ids to file system("fastacmd -d $blastdb -s 'lcl|$key' >> $seq_db_fast +a"); $hitcount++; } close $fasta; close $parse; if ($hitcount) { print $log "Round $round has $hitcount unique hits\n\n"; next; } print $log "No hits in Round $round ==> STOP\n"; last; } print "Total uniq hits found in all rounds: ", scalar keys %hits, +"\n"; close($log); }

        Note that I've eliminated system calls to delete files and get the date and replaced them with the equivalent Perl functions! So far as I can see there are no changes in the logic of the code, but to my mind it is somewhat cleaner.

        True laziness is hard work