Hello Monks!

I'm trying to do something I thought would be relatively simple, but it's, well, not. Basically I have a large BLAST output file and a list of headers in FastA format. All I want to do is extract the blast entries (from "Query=" to "Effective search space used: ~") for each of the headers in the list, into one smaller blast output file. You're probably familiar with blast outputs, but here's a snippet anyway:

Query= M01757:19:000000000-A3WAT:1:1101:15902:2061 1:N:0:1 Length=53 +Score E Sequences producing significant alignments: ( +Bits) Value dbj|AP012555.1| Mycobacterium avium subsp. hominissuis TH135 chr... +75.0 3e-11 gb|CP005928.1| Mycobacterium avium subsp. paratuberculosis MAP4,... +75.0 3e-11 gb|CP000479.1| Mycobacterium avium 104, complete genome +75.0 3e-11 gb|AE016958.1| Mycobacterium avium subsp. paratuberculosis str. ... +75.0 3e-11 >dbj|AP012555.1| Mycobacterium avium subsp. hominissuis TH135 chromoso +mal DNA, complete genome Length=4951217 Score = 75.0 bits (40), Expect = 3e-11 Identities = 42/43 (98%), Gaps = 0/43 (0%) Strand=Plus/Plus Query 7 CGGTGGGCAAGAAGGAACTCGAGAAGGACCCGCTGATGGGCAC 49 |||||||||||||||||||||||||||||||| |||||||||| Sbjct 1543504 CGGTGGGCAAGAAGGAACTCGAGAAGGACCCGGTGATGGGCAC 1543546 Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 1252612366488 Query= M01757:19:000000000-A3WAT:1:1101:16099:2064 1:N:0:1 Length=102 ***** No hits found ***** Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 3701553444831

...and so on. The header list looks like:

>M01757:19:000000000-A3WAT:1:1101:15902:2061 1:N:0:1 >M01757:19:000000000-A3WAT:1:1101:19601:5029 1:N:0:1 >M01757:19:000000000-A3WAT:1:1101:17448:5029 1:N:0:1

etc etc.

.

I've tried a nested while loop and the until function, that looks like:

# List of headers # BLAST file # Output file (BLAST format) use strict; # Specify I/O my $headerfile = $ARGV[0]; my $blastfile = $ARGV[1]; my $outfile = $ARGV[2]; # open header file open (HEADERFILE, $headerfile) or die "Cannot open header file\n"; # open blast file open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n"; # open output file open(OUTPUT, ">$outfile"); my $ender = "Query="; while (my $headerline = <HEADERFILE> ) { chomp $headerline; my $headercut = substr $headerline, 1; while (my $blastline = <BLASTFILE>) { if ($blastline =~ $headercut) { print $blastline; print OUTPUT $blastline; print until ( ($_ = <BLASTFILE>) =~ /$ender/i); print OUTPUT until ( ($_ = <BLASTFILE>) =~ /$ender/i); } } seek (BLASTFILE,0,0); } exit;

This works on test files, but isn't appropriate for my proper files as the blast file are >10G each (so looping through the whole thing for a few thousand headers takes forever. Reversing the loop just gives me no output, so I've tried putting the headers into an array and looping through the larger file like so:

# List of headers # BLAST file # Output file (BLAST format) use strict; # Specify I/O my $blastfile = $ARGV[0]; my $headerfile = $ARGV[1]; my $outfile = $ARGV[2]; # open blast file open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n"; # open header file open (HEADERFILE, $headerfile) or die "Cannot open header file\n"; # open output file open(OUTPUT, ">$outfile"); my @headerline = <HEADERFILE>; my $headerline = join ( '', @headerline); while (my $blastline = <BLASTFILE> ) { my $blastcut = substr $blastline, 7; if ($headerline =~ /\Q$blastcut\E/) { print $blastcut; } } exit;

This works for getting the first part, i.e. the headers from the blast output file so I know the loop in theory works. It's getting the rest of the information which is a bit tricky. I've tried using the until function as before to get to the next query entry like so:

# open blast file open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n"; # open header file open (HEADERFILE, $headerfile) or die "Cannot open header file\n"; # open output file open(OUTPUT, ">$outfile"); my $ender = "Query="; my @headerline = <HEADERFILE>; my $headerline = join ( '', @headerline); while (my $blastline = <BLASTFILE> ) { my $blastcut = substr $blastline, 7; if ($headerline =~ /\Q$blastcut\E/) { print $blastcut; print until ( ($_ = <BLASTFILE>) =~ /$ender/i); } } exit;

...but that more or less prints the entire blast file, and I have no idea why. I can't use a here-doc command as the blast file doesn't have any solo EOF points between the individual entries, so I'm kind of stuck. Any ideas greatly appreciated (and sorry for the shameful length of this post)!


In reply to Extracting BLAST hits from a list of sequences by Oligo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.