comment on

Hello, I am trying to compare two fasta sequences with different headers , using Bioperl. In case you are not familiar with the fasta format, a fasta file is in the format:

>Header
Sequence of letters of varied length.
>Header
Sequence of letters of varied length.
[download]

I have generated a perl script (with the help of several online sources) that takes FILE1 and FILE2, and outputs from FILE2 those sequences (and their headers) which match the sequences from FILE1. The only problem I have is that the run is very slow, as FILE1 has around 200 sequences but FILE2 had >12 million sequences! Is there a faster way to do this comparison? Could you kindly show me how I could modify the code to achieve that? Also, is there a way to output the matching header of FILE1 to a different output file? (Ie, find out which sequence of FILE1 had matches in FILE2). In this code I am only creating a an array of the FILE1 sequences but not their headers. I'm not sure how to create a 'struct' of headers and sequences, so any advice would be appreciated. The code I have written is as follows:

#!/usr/local/bin/perl
#use strict; 
use warnings;
use Errno;
use lib " /RemotePerl";
use lib " /System/Library/Perl/5.8.6";
use lib " /Library/Perl/5.8.6";
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;

if (@ARGV < 2) { die "usage: compare.pl <filename1> <filename2>\n"; }

$file1 = $ARGV[0];
$file2 = $ARGV[1];

#Read in first fasta file
$FastaFile1 = Bio::SeqIO->new(-file => $ARGV[0], -format => 'fasta');
print "File name of the first fasta file is: ".$file1."\n";

#Create array of the fasta sequences from file 1
my @fasta_objs =();

#Add sequences of file1 to array called fasta_objs
while (my $seqFile1 = $FastaFile1->next_seq() )
{
push @fasta_objs,$seqFile1->seq;
}

#Read in second fasta file
$FastaFile2 = Bio::SeqIO->new(-file => $ARGV[1], -format => 'fasta');
print "File name of the first fasta file is: ".$file2."\n";

#Setup output file: sequences from File2 which match sequences of File
+1
my $fasta = Bio::SeqIO->new(-file => ">EQUAL_HITS.fasta", -format => "
+fasta", -flush  => 0);

# write matching sequences of file2 to the output file
# Note that if several matches to one sequence exist, ALL matches are 
+output to the file
while(my $seqFile2 = $FastaFile2->next_seq() )
{
        $fasta->write_seq($seqFile2) if (grep {$_ eq $seqFile2->seq} @
+fasta_objs);
}

print "Comparison is complete! \n";
[download]

Thank you very much in advance for your help!

In reply to Compare fasta files with different headers by InfoSeeker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.