Compare fasta files with different headers

InfoSeeker has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am trying to compare two fasta sequences with different headers , using Bioperl. In case you are not familiar with the fasta format, a fasta file is in the format:

>Header
Sequence of letters of varied length.
>Header
Sequence of letters of varied length.
[download]

I have generated a perl script (with the help of several online sources) that takes FILE1 and FILE2, and outputs from FILE2 those sequences (and their headers) which match the sequences from FILE1. The only problem I have is that the run is very slow, as FILE1 has around 200 sequences but FILE2 had >12 million sequences! Is there a faster way to do this comparison? Could you kindly show me how I could modify the code to achieve that? Also, is there a way to output the matching header of FILE1 to a different output file? (Ie, find out which sequence of FILE1 had matches in FILE2). In this code I am only creating a an array of the FILE1 sequences but not their headers. I'm not sure how to create a 'struct' of headers and sequences, so any advice would be appreciated. The code I have written is as follows:

#!/usr/local/bin/perl
#use strict; 
use warnings;
use Errno;
use lib " /RemotePerl";
use lib " /System/Library/Perl/5.8.6";
use lib " /Library/Perl/5.8.6";
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

if (@ARGV < 2) { die "usage: compare.pl <filename1> <filename2>\n"; }

$file1 = $ARGV[0];
$file2 = $ARGV[1];

#Read in first fasta file
$FastaFile1 = Bio::SeqIO->new(-file => $ARGV[0], -format => 'fasta');
print "File name of the first fasta file is: ".$file1."\n";

#Create array of the fasta sequences from file 1
my @fasta_objs =();

#Add sequences of file1 to array called fasta_objs
while (my $seqFile1 = $FastaFile1->next_seq() )
{
push @fasta_objs,$seqFile1->seq;
}

#Read in second fasta file
$FastaFile2 = Bio::SeqIO->new(-file => $ARGV[1], -format => 'fasta');
print "File name of the first fasta file is: ".$file2."\n";

#Setup output file: sequences from File2 which match sequences of File
+1
my $fasta = Bio::SeqIO->new(-file => ">EQUAL_HITS.fasta", -format => "
+fasta", -flush  => 0);

# write matching sequences of file2 to the output file
# Note that if several matches to one sequence exist, ALL matches are 
+output to the file
while(my $seqFile2 = $FastaFile2->next_seq() )
{
        $fasta->write_seq($seqFile2) if (grep {$_ eq $seqFile2->seq} @
+fasta_objs);
}

print "Comparison is complete! \n";
[download]

Thank you very much in advance for your help!

Comment on Compare fasta files with different headers Select or Download Code

Replies are listed 'Best First'.
Re: Compare fasta files with different headers by kcott (Archbishop) on Dec 01, 2010 at 02:48 UTC
You're reading through the entire `@fasta_objs` array 12 million times. Using a hash, rather than an array, should greatly reduce processing time. Change these three lines: `my @fasta_objs =(); ... push @fasta_objs,$seqFile1->seq; ... $fasta->write_seq($seqFile2) if (grep {$_ eq $seqFile2->seq} @fasta_ob +js);` [download] to `my %fasta_objs =(); ... ++$fasta_objs{$seqFile1->seq}; ... $fasta->write_seq($seqFile2) if exists $fasta_objs{$seqFile2->seq};` [download] I'm not familiar with the `Bio::` modules so I can't offer advice on how to capture the headers (perhaps you already know or can find out through the documentation); however, once you have the header, you can store it in the hash. So, instead of: `++$fasta_objs{$seqFile1->seq};` [download] use `$fasta_objs{$seqFile1->seq} = $header;` [download] This assumes header/sequence combinations are unique. If that's not the case, you'll need a more complex storage solution - maybe something like: `seq => [header1, header2, ...]` [download] Finally, I would strongly recommend that you do not comment out `use strict;` globally. If you really need to, just turn strictures off for a small piece of code, e.g. `# Comment explaining why you're doing this no strict 'refs'; ... small piece of code here ... use strict 'refs';` [download] -- Ken	[reply] [d/l] [select]
Re: Compare fasta files with different headers by BrowserUk (Patriarch) on Dec 01, 2010 at 03:21 UTC
This is untested, but could run roughly 200 times faster than your original. Also, from my reading of the POD, the output file should contain both sequences and their ids. #!/usr/local/bin/perl use strict; use warnings; use Errno; use lib " /RemotePerl"; use lib " /System/Library/Perl/5.8.6"; use lib " /Library/Perl/5.8.6"; use Bio::Perl; use Bio::SeqIO; use Bio::SearchIO; if (@ARGV < 2) { die "usage: compare.pl <filename1> <filename2>\n"; } my $file1 = $ARGV[0]; my $file2 = $ARGV[1]; # Open first fasta file my $File1 = Bio::SeqIO->new(-file => $file1, -format => 'fasta'); print "File name of the first fasta file is: ".$file1."\n"; #Create a hash from file 1 my %fasta1; while( my $seq1 = $File1->next_seq() ) { $fasta1{ $seq1->seq } = $seq1; } # Open second fasta file my $File2 = Bio::SeqIO->new(-file => $file2, -format => 'fasta'); print "File name of the first fasta file is: ".$file2."\n"; #Setup output file: sequences from File2 which match sequences of File +1 my $output = Bio::SeqIO->new( -file => ">EQUAL_HITS.fasta", -format => "fasta", -flush => 0 ); # write matching sequences of file2 to the output file # Note that if several matches to one sequence exist, # ALL matches are output to the file while( my $seq2 = $File2->next_seq() ) { if( exists $fasta1{ $seq2->seq } ) { $output->write_seq( $fasta1{ $seq2->seq } ); } } print "Comparison is complete! \n"; [download] The main change is that it creates a hash rather than an array from the sequences of file1. This makes the lookup O(1) rather than O(200). It stores the sequence objects returned by next_seq() as the values, keyed by the sequence, and when matches are found, it give the sequence objects back to write_seq(), for inclusion in the output file. Hopefully, write_seq() knows what to do with them. Other things to note: use strict is not commented out; I've shortened some of your variable names; the code is indented; as you've set up variables `$file1 & $file2`, you might as well use them rather than `$ARGV{n]`. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Compare fasta files with different headers by InfoSeeker (Novice) on Dec 07, 2010 at 02:11 UTC
Thank you both for the responses! It works fine now :)	[reply]