in reply to Re: Iterative BLAST(N) program
in thread Iterative BLAST(N) program

Thanks for the reply.

So the part at "# get header and sequence with 'fastacmd' and write to new fasta file" is handled correctly? Cuz I'm trying to get only new hits to use in the next rounds and so on.

Replies are listed 'Best First'.
Re^3: Iterative BLAST(N) program
by GrandFather (Saint) on Jun 08, 2011 at 23:13 UTC

    I can't tell because I don't know what the lines you are dealing with look like. I can only help with overall structure and technique. On the face of it the handling looks OK, but it does depend very much on the content of the first two fields on each line. Can the letter case of the contents be different or can numeric representation be different between the fields for example?

    I've applied the changes I suggested in my reply along with a few others to give:

    #!/usr/bin/perl use strict; use warnings; sub iterblast { # settings my $myblast = "blastall -p blastn -m 8"; my $starterseq = "/share/home/tjuh/Thema12-2011/Project/testblast/matchedhits.f +asta"; my $blastdb = "/share/home/tjuh/Thema12-2011/Project/databases/termite_hindg +ut/hindgut_db"; my $outputroot = "/share/home/tjuh/Thema12-2011/Project/testblast/ +test1"; my $evalue = 0.01; my $wordsize = 11; my $roundlimit = 4; # output files: my $seq_db_fasta = $outputroot . ".tmp"; my $seq_db_all = $outputroot . ".all"; my $logfile = $outputroot . ".log"; # remove existing files unlink $seq_db_fasta; unlink $seq_db_all; unlink $logfile; open my $log, '>', $logfile or die "Can't open $logfile: $!\n"; print $log <<LOGMSG; Configurations: Input file : $starterseq Output files : $outputroot\- Blast Engine : $myblast Target database : $blastdb E-value cut-off : $evalue Blast word size : $wordsize Blast round limit: $roundlimit Runs: LOGMSG # hits hash my %hits; my %queryids; # Start the iterative blastn rounds for my $round (1 .. $roundlimit) { my $saveblst = $outputroot . ".txt"; unlink $saveblst; # print $log info my $curtime = localtime(); my $infile = $round == 1 ? $starterseq : $seq_db_fasta; print $log "$curtime - Blast seq: $infile\n"; print $log "$curtime - Blast db : $blastdb\n"; # run blast system( "$myblast -i $infile -d $blastdb -o $saveblst -e $evalue - +W $wordsize" ); open my $parse, '<', $saveblst or die "Can't open $saveblst: $ +!\n"; open my $fasta, '>', $seq_db_fasta or die "Can't create $seq_db_fasta: $!\n"; # parse blast output (tab form) and add matched id's in hits h +ash while (defined (my $line = <$parse>)) { chomp $line; my @fields = split(/\t/, $line); ++$queryids{$fields[0]}; ++$hits{$fields[1]}; } # get header and sequence with 'fastacmd' and write to new fas +ta file my $hitcount; while (my $key = each %hits) { next if exists $queryids{$key}; # only write uniq ids to file system("fastacmd -d $blastdb -s 'lcl|$key' >> $seq_db_fast +a"); $hitcount++; } close $fasta; close $parse; if ($hitcount) { print $log "Round $round has $hitcount unique hits\n\n"; next; } print $log "No hits in Round $round ==> STOP\n"; last; } print "Total uniq hits found in all rounds: ", scalar keys %hits, +"\n"; close($log); }

    Note that I've eliminated system calls to delete files and get the date and replaced them with the equivalent Perl functions! So far as I can see there are no changes in the logic of the code, but to my mind it is somewhat cleaner.

    True laziness is hard work

      Thank u GrandFather for cleaning up the code a bit, much appreciated

      Im adding an example blastoutput ($saveblast) of the first and second round so u can see if im doing it rite

      Round 1 blast output

      2004112448 2004112448 100.00 738 0 0 1 738 1 + 738 0.0 1421 2004112448 2004113832 90.44 586 54 2 153 738 1 +6 599 0.0 660 2004112448 2004101124 89.67 368 37 1 361 728 1 + 367 3e-104 379 2004112448 2004105206 83.33 606 101 0 76 681 4 + 609 1e-100 367 2004112448 2004086271 85.95 363 51 0 341 703 8 +3 445 5e-75 281 2004112448 2004120286 80.98 205 39 0 208 412 3 +73 577 1e-19 97.6 2004112448 2004133966 91.43 70 6 0 618 687 261 + 192 9e-18 91.7 2004112448 2004136134 87.84 74 9 0 193 266 345 + 418 5e-13 75.8 2004112448 2004112239 84.34 83 13 0 461 543 10 +1 183 8e-09 61.9 2004112448 2004112239 89.36 47 5 0 366 412 6 + 52 2e-06 54.0 2004112448 2004097170 85.14 74 11 0 193 266 34 +6 419 3e-08 60.0 2004112448 2004127588 86.15 65 9 0 193 257 346 + 410 1e-07 58.0 2004112448 2004119982 86.15 65 9 0 193 257 346 + 410 1e-07 58.0 2004112448 2004100435 86.15 65 9 0 193 257 55 + 119 1e-07 58.0 2004112448 2004135256 82.56 86 15 0 184 269 33 +7 422 8e-06 52.0 2004112448 2004121539 81.93 83 15 0 193 275 34 +6 428 5e-04 46.1 2004112448 2004092379 81.93 83 15 0 193 275 34 +6 428 5e-04 46.1 2004099281 2004099281 100.00 844 0 0 1 844 1 + 844 0.0 1661 2004099281 2004142218 90.13 446 43 1 400 844 1 + 446 1e-149 529 2004099281 2004092058 81.88 458 82 1 376 832 7 +6 533 1e-63 244 2004099281 2004147330 83.64 110 18 0 424 533 5 +41 650 6e-13 75.8 2004099281 2004147330 82.86 70 12 0 230 299 34 +7 416 0.002 44.1 2004099281 2004146484 86.05 86 12 0 685 770 57 +7 662 6e-13 75.8 2004099281 2004146484 85.88 85 12 0 424 508 31 +6 400 2e-12 73.8 2004099281 2004146484 83.10 71 12 0 229 299 12 +1 191 5e-04 46.1 2004099281 2004106112 81.68 131 23 1 694 823 6 +13 743 2e-09 63.9 2004099281 2004147824 85.71 70 10 0 439 508 49 +0 559 4e-08 60.0 2004099281 2004153303 88.64 44 5 0 439 482 619 + 662 1e-04 48.1 2004099281 2004082897 81.52 92 17 0 664 755 10 + 101 1e-04 48.1 2004099281 2004142389 89.47 38 4 0 795 832 169 + 206 0.002 44.1

      round 2 blast output

      2004082897 2004082897 100.00 810 0 0 1 810 1 + 810 0.0 1582 2004082897 2004156684 87.74 261 32 0 3 263 153 + 413 1e-69 264 2004082897 2004149613 80.35 514 99 2 289 801 1 + 513 2e-53 210 2004082897 2004146484 87.91 91 11 0 8 98 554 + 644 2e-18 93.7 2004082897 2004126309 81.10 164 31 0 8 171 80 + 243 4e-14 79.8 2004082897 2004106112 85.11 94 14 0 8 101 581 + 674 6e-13 75.8 2004082897 2004127134 95.45 44 2 0 127 170 745 + 788 9e-12 71.9 2004082897 2004127134 91.18 34 3 0 223 256 844 + 877 0.002 44.1 2004082897 2004117857 79.46 185 38 0 621 805 4 +5 229 9e-12 71.9 2004082897 2004142219 87.69 65 8 0 25 89 904 + 968 6e-10 65.9 2004082897 2004142219 90.57 53 5 0 136 188 101 +5 1067 6e-10 65.9 2004082897 2004142389 100.00 29 0 0 571 599 60 +1 629 1e-07 58.0 2004082897 2004142389 96.43 28 1 0 143 170 170 + 197 1e-04 48.1 2004082897 2004092058 83.15 89 15 0 13 101 367 + 455 1e-07 58.0 2004082897 2004142218 87.27 55 7 0 10 64 265 + 319 2e-06 54.0 2004082897 2004117248 96.67 30 1 0 142 171 667 + 696 8e-06 52.0 2004082897 2004125016 96.55 29 1 0 571 599 571 + 599 3e-05 50.1 2004082897 2004125016 96.15 26 1 0 145 170 142 + 167 0.002 44.1 2004082897 2004099281 81.52 92 17 0 10 101 664 + 755 1e-04 48.1 2004082897 2004117249 93.10 29 2 0 571 599 244 + 272 0.008 42.1 2004083856 2004083856 100.00 774 0 0 1 774 1 + 774 0.0 1534 2004083856 2004128009 87.09 767 84 3 13 764 41 +2 1178 0.0 735 2004083856 2004100486 89.04 365 40 0 408 772 1 +23 487 1e-112 406 2004083856 2004112466 91.47 211 18 0 1 211 598 + 808 3e-73 276 2004083856 2004147543 84.54 194 30 0 346 539 8 +71 1064 2e-34 147 2004083856 2004131474 84.74 190 29 0 346 535 8 +86 1075 2e-34 147 2004083856 2004131474 87.80 41 5 0 61 101 613 + 653 0.008 42.1 2004083856 2004131474 82.61 69 12 0 138 206 69 +0 758 0.008 42.1 2004083856 2004097313 89.36 94 10 0 679 772 76 + 169 2e-22 107 2004083856 2004149567 90.91 44 4 0 73 116 235 + 278 5e-07 56.0

      As u can see for example 2004099281 and 2004082897 in the second column, they appear in both rounds, and what im trying to do is to only get the new uniq hits that are to be used in new rounds. I hope i made myself a bit clear, its a bit different to explain.

        I started working through your code to eliminate the system calls and modify the file I/O to use string files when I realised that toolic had already identified a major bug with your code. See his second bullet point in Re: Iterative BLAST(N) program. toolic also makes the very strong point that you should write some test code or otherwise validate your results using a test set of data.

        It may be that the following self contained sample code provides a test bed for you to determine what you need to know:

        #!/usr/bin/perl use strict; use warnings; my @roundData = (<<DATA1, <<DATA2); 2004112448 2004112448 100.00 738 0 0 1 738 1 + 738 0.0 1421 2004112448 2004113832 90.44 586 54 2 153 738 1 +6 599 0.0 660 2004112448 2004101124 89.67 368 37 1 361 728 1 + 367 3e-104 379 2004112448 2004105206 83.33 606 101 0 76 681 4 + 609 1e-100 367 2004112448 2004086271 85.95 363 51 0 341 703 8 +3 445 5e-75 281 2004112448 2004120286 80.98 205 39 0 208 412 3 +73 577 1e-19 97.6 2004112448 2004133966 91.43 70 6 0 618 687 261 + 192 9e-18 91.7 2004112448 2004136134 87.84 74 9 0 193 266 345 + 418 5e-13 75.8 2004112448 2004112239 84.34 83 13 0 461 543 10 +1 183 8e-09 61.9 2004112448 2004112239 89.36 47 5 0 366 412 6 + 52 2e-06 54.0 2004112448 2004097170 85.14 74 11 0 193 266 34 +6 419 3e-08 60.0 2004112448 2004127588 86.15 65 9 0 193 257 346 + 410 1e-07 58.0 2004112448 2004119982 86.15 65 9 0 193 257 346 + 410 1e-07 58.0 2004112448 2004100435 86.15 65 9 0 193 257 55 + 119 1e-07 58.0 2004112448 2004135256 82.56 86 15 0 184 269 33 +7 422 8e-06 52.0 2004112448 2004121539 81.93 83 15 0 193 275 34 +6 428 5e-04 46.1 2004112448 2004092379 81.93 83 15 0 193 275 34 +6 428 5e-04 46.1 2004099281 2004099281 100.00 844 0 0 1 844 1 + 844 0.0 1661 2004099281 2004142218 90.13 446 43 1 400 844 1 + 446 1e-149 529 2004099281 2004092058 81.88 458 82 1 376 832 7 +6 533 1e-63 244 2004099281 2004147330 83.64 110 18 0 424 533 5 +41 650 6e-13 75.8 2004099281 2004147330 82.86 70 12 0 230 299 34 +7 416 0.002 44.1 2004099281 2004146484 86.05 86 12 0 685 770 57 +7 662 6e-13 75.8 2004099281 2004146484 85.88 85 12 0 424 508 31 +6 400 2e-12 73.8 2004099281 2004146484 83.10 71 12 0 229 299 12 +1 191 5e-04 46.1 2004099281 2004106112 81.68 131 23 1 694 823 6 +13 743 2e-09 63.9 2004099281 2004147824 85.71 70 10 0 439 508 49 +0 559 4e-08 60.0 2004099281 2004153303 88.64 44 5 0 439 482 619 + 662 1e-04 48.1 2004099281 2004082897 81.52 92 17 0 664 755 10 + 101 1e-04 48.1 2004099281 2004142389 89.47 38 4 0 795 832 169 + 206 0.002 44.1 DATA1 2004082897 2004082897 100.00 810 0 0 1 810 1 + 810 0.0 1582 2004082897 2004156684 87.74 261 32 0 3 263 153 + 413 1e-69 264 2004082897 2004149613 80.35 514 99 2 289 801 1 + 513 2e-53 210 2004082897 2004146484 87.91 91 11 0 8 98 554 + 644 2e-18 93.7 2004082897 2004126309 81.10 164 31 0 8 171 80 + 243 4e-14 79.8 2004082897 2004106112 85.11 94 14 0 8 101 581 + 674 6e-13 75.8 2004082897 2004127134 95.45 44 2 0 127 170 745 + 788 9e-12 71.9 2004082897 2004127134 91.18 34 3 0 223 256 844 + 877 0.002 44.1 2004082897 2004117857 79.46 185 38 0 621 805 4 +5 229 9e-12 71.9 2004082897 2004142219 87.69 65 8 0 25 89 904 + 968 6e-10 65.9 2004082897 2004142219 90.57 53 5 0 136 188 101 +5 1067 6e-10 65.9 2004082897 2004142389 100.00 29 0 0 571 599 60 +1 629 1e-07 58.0 2004082897 2004142389 96.43 28 1 0 143 170 170 + 197 1e-04 48.1 2004082897 2004092058 83.15 89 15 0 13 101 367 + 455 1e-07 58.0 2004082897 2004142218 87.27 55 7 0 10 64 265 + 319 2e-06 54.0 2004082897 2004117248 96.67 30 1 0 142 171 667 + 696 8e-06 52.0 2004082897 2004125016 96.55 29 1 0 571 599 571 + 599 3e-05 50.1 2004082897 2004125016 96.15 26 1 0 145 170 142 + 167 0.002 44.1 2004082897 2004099281 81.52 92 17 0 10 101 664 + 755 1e-04 48.1 2004082897 2004117249 93.10 29 2 0 571 599 244 + 272 0.008 42.1 2004083856 2004083856 100.00 774 0 0 1 774 1 + 774 0.0 1534 2004083856 2004128009 87.09 767 84 3 13 764 41 +2 1178 0.0 735 2004083856 2004100486 89.04 365 40 0 408 772 1 +23 487 1e-112 406 2004083856 2004112466 91.47 211 18 0 1 211 598 + 808 3e-73 276 2004083856 2004147543 84.54 194 30 0 346 539 8 +71 1064 2e-34 147 2004083856 2004131474 84.74 190 29 0 346 535 8 +86 1075 2e-34 147 2004083856 2004131474 87.80 41 5 0 61 101 613 + 653 0.008 42.1 2004083856 2004131474 82.61 69 12 0 138 206 69 +0 758 0.008 42.1 2004083856 2004097313 89.36 94 10 0 679 772 76 + 169 2e-22 107 2004083856 2004149567 90.91 44 4 0 73 116 235 + 278 5e-07 56.0 DATA2 iterblast(); sub iterblast { # settings my $myblast = "blastall -p blastn -m 8"; my $blastdb = 'foo'; my $evalue = 0.01; my $wordsize = 11; my $roundlimit = 4; print <<LOGMSG; Configurations: Output files : xxx\- Blast Engine : $myblast Target database : $blastdb E-value cut-off : $evalue Blast word size : $wordsize Blast round limit: $roundlimit Runs: LOGMSG # hits hash my %hits; my %queryids; my @fastaDbs; # Start the iterative blastn rounds for my $round (1 .. @roundData) { open my $parse, '<', \$roundData[$round - 1]; push @fastaDbs, ''; open my $fasta, '>', \$fastaDbs[-1]; # parse blast output (tab form) and add matched id's in hits h +ash while (defined (my $line = <$parse>)) { chomp $line; my @fields = split(/\t/, $line); ++$queryids{$fields[0]}; ++$hits{$fields[1]}; } # get header and sequence with 'fastacmd' and write to new fas +ta file my $hitcount; for my $hit (keys %hits) { next if exists $queryids{$hit}; # only write uniq ids to file print $fasta "fastacmd -d $blastdb -s 'lcl|$hit'\n"; $hitcount++; } close $fasta; close $parse; if ($hitcount) { print "Round $round has $hitcount unique hits\n\n"; next; } print "No hits in Round $round ==> STOP\n"; last; } print '-' x 20, "\n$_\n" for @fastaDbs; print "Total uniq hits found in all rounds: ", scalar keys %hits, +"\n"; }

        The major interesting technique here (used in various ways in several places) is using a variable as a file. That avoids writing or reading any actual files at all, but allows most of the interesting processing code to remain intact.

        True laziness is hard work