InfoSeeker has asked for the wisdom of the Perl Monks concerning the following question:
I have generated a perl script (with the help of several online sources) that takes FILE1 and FILE2, and outputs from FILE2 those sequences (and their headers) which match the sequences from FILE1. The only problem I have is that the run is very slow, as FILE1 has around 200 sequences but FILE2 had >12 million sequences! Is there a faster way to do this comparison? Could you kindly show me how I could modify the code to achieve that? Also, is there a way to output the matching header of FILE1 to a different output file? (Ie, find out which sequence of FILE1 had matches in FILE2). In this code I am only creating a an array of the FILE1 sequences but not their headers. I'm not sure how to create a 'struct' of headers and sequences, so any advice would be appreciated. The code I have written is as follows:>Header Sequence of letters of varied length. >Header Sequence of letters of varied length.
Thank you very much in advance for your help!#!/usr/local/bin/perl #use strict; use warnings; use Errno; use lib " /RemotePerl"; use lib " /System/Library/Perl/5.8.6"; use lib " /Library/Perl/5.8.6"; use Bio::Perl; use Bio::SeqIO; use IO::String; use Bio::SearchIO; if (@ARGV < 2) { die "usage: compare.pl <filename1> <filename2>\n"; } $file1 = $ARGV[0]; $file2 = $ARGV[1]; #Read in first fasta file $FastaFile1 = Bio::SeqIO->new(-file => $ARGV[0], -format => 'fasta'); print "File name of the first fasta file is: ".$file1."\n"; #Create array of the fasta sequences from file 1 my @fasta_objs =(); #Add sequences of file1 to array called fasta_objs while (my $seqFile1 = $FastaFile1->next_seq() ) { push @fasta_objs,$seqFile1->seq; } #Read in second fasta file $FastaFile2 = Bio::SeqIO->new(-file => $ARGV[1], -format => 'fasta'); print "File name of the first fasta file is: ".$file2."\n"; #Setup output file: sequences from File2 which match sequences of File +1 my $fasta = Bio::SeqIO->new(-file => ">EQUAL_HITS.fasta", -format => " +fasta", -flush => 0); # write matching sequences of file2 to the output file # Note that if several matches to one sequence exist, ALL matches are +output to the file while(my $seqFile2 = $FastaFile2->next_seq() ) { $fasta->write_seq($seqFile2) if (grep {$_ eq $seqFile2->seq} @ +fasta_objs); } print "Comparison is complete! \n";
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Compare fasta files with different headers
by kcott (Archbishop) on Dec 01, 2010 at 02:48 UTC | |
|
Re: Compare fasta files with different headers
by BrowserUk (Patriarch) on Dec 01, 2010 at 03:21 UTC | |
by InfoSeeker (Novice) on Dec 07, 2010 at 02:11 UTC |