comment on

Hello Monks!

I'm trying to do something I thought would be relatively simple, but it's, well, not. Basically I have a large BLAST output file and a list of headers in FastA format. All I want to do is extract the blast entries (from "Query=" to "Effective search space used: ~") for each of the headers in the list, into one smaller blast output file. You're probably familiar with blast outputs, but here's a snippet anyway:

Query= M01757:19:000000000-A3WAT:1:1101:15902:2061 1:N:0:1

Length=53
                                                                      
+Score     E
Sequences producing significant alignments:                          (
+Bits)  Value

dbj|AP012555.1|  Mycobacterium avium subsp. hominissuis TH135 chr...  
+75.0    3e-11
gb|CP005928.1|  Mycobacterium avium subsp. paratuberculosis MAP4,...  
+75.0    3e-11
gb|CP000479.1|  Mycobacterium avium 104, complete genome              
+75.0    3e-11
gb|AE016958.1|  Mycobacterium avium subsp. paratuberculosis str. ...  
+75.0    3e-11


>dbj|AP012555.1| Mycobacterium avium subsp. hominissuis TH135 chromoso
+mal DNA, 
complete genome
Length=4951217

 Score = 75.0 bits (40),  Expect = 3e-11
 Identities = 42/43 (98%), Gaps = 0/43 (0%)
 Strand=Plus/Plus

Query  7        CGGTGGGCAAGAAGGAACTCGAGAAGGACCCGCTGATGGGCAC  49
                |||||||||||||||||||||||||||||||| ||||||||||
Sbjct  1543504  CGGTGGGCAAGAAGGAACTCGAGAAGGACCCGGTGATGGGCAC  1543546



Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 1252612366488

Query= M01757:19:000000000-A3WAT:1:1101:16099:2064 1:N:0:1

Length=102


***** No hits found *****



Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 3701553444831
[download]

...and so on. The header list looks like:

>M01757:19:000000000-A3WAT:1:1101:15902:2061 1:N:0:1
>M01757:19:000000000-A3WAT:1:1101:19601:5029 1:N:0:1
>M01757:19:000000000-A3WAT:1:1101:17448:5029 1:N:0:1
[download]

etc etc.

I've tried a nested while loop and the until function, that looks like:

# List of headers
# BLAST file
# Output file (BLAST format)

use strict;

# Specify I/O

my $headerfile = $ARGV[0];
my $blastfile = $ARGV[1];
my $outfile = $ARGV[2];


# open header file
open (HEADERFILE, $headerfile) or die "Cannot open header file\n";
# open blast file
open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n";
# open output file
open(OUTPUT, ">$outfile");

my $ender = "Query=";

while (my $headerline = <HEADERFILE> ) {
    chomp $headerline;
    my $headercut = substr $headerline, 1;
    
    while (my $blastline = <BLASTFILE>) { 
        if ($blastline =~ $headercut) {
            print $blastline;
            print OUTPUT $blastline;
            print until ( ($_ = <BLASTFILE>) =~ /$ender/i);
            print OUTPUT until ( ($_ = <BLASTFILE>) =~ /$ender/i);
            }
            }
        seek (BLASTFILE,0,0);
        }
    

exit;
[download]

This works on test files, but isn't appropriate for my proper files as the blast file are >10G each (so looping through the whole thing for a few thousand headers takes forever. Reversing the loop just gives me no output, so I've tried putting the headers into an array and looping through the larger file like so:

# List of headers
# BLAST file
# Output file (BLAST format)

use strict;

# Specify I/O
my $blastfile = $ARGV[0];
my $headerfile = $ARGV[1];
my $outfile = $ARGV[2];


# open blast file
open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n";
# open header file
open (HEADERFILE, $headerfile) or die "Cannot open header file\n";
# open output file
open(OUTPUT, ">$outfile");


my @headerline = <HEADERFILE>;
my $headerline = join ( '', @headerline);

while (my $blastline = <BLASTFILE> ) {
    my $blastcut = substr $blastline, 7;
    if ($headerline =~ /\Q$blastcut\E/) {
    print $blastcut;
    
    }
    }
exit;
[download]

This works for getting the first part, i.e. the headers from the blast output file so I know the loop in theory works. It's getting the rest of the information which is a bit tricky. I've tried using the until function as before to get to the next query entry like so:

# open blast file
open (BLASTFILE, $blastfile) or die "Cannot open BLAST file\n";
# open header file
open (HEADERFILE, $headerfile) or die "Cannot open header file\n";
# open output file
open(OUTPUT, ">$outfile");

my $ender = "Query=";
my @headerline = <HEADERFILE>;
my $headerline = join ( '', @headerline);

while (my $blastline = <BLASTFILE> ) {
    my $blastcut = substr $blastline, 7;
    if ($headerline =~ /\Q$blastcut\E/) {
    print $blastcut;
    print until ( ($_ = <BLASTFILE>) =~ /$ender/i);
    }
    }
exit;
[download]

...but that more or less prints the entire blast file, and I have no idea why. I can't use a here-doc command as the blast file doesn't have any solo EOF points between the individual entries, so I'm kind of stuck. Any ideas greatly appreciated (and sorry for the shameful length of this post)!

In reply to Extracting BLAST hits from a list of sequences by Oligo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.