Re^2: comparing two fasta files

Replies are listed 'Best First'.
Re^3: comparing two fasta files by graff (Chancellor) on Aug 02, 2008 at 03:01 UTC
I've never used Bio::DB::Fasta, but having read the man page for it just now, and looking again the OP task description, I think what you want is something like this: `use strict; use Bio::DB::Fasta; my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg +er file or die "Failed to create Fasta DB object on Library.fa\n"; open( IDLIST, "Name.fa" ); # this is the list of ID strings while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print ">$id\n", $db->seq( $id ), "\n"; }` [download] As for the code you've just posted, you didn't really try to use the DB::Fasta module, and you don't seem to understand the syntax for a while loop (you need parens around the conditional, and a block of code to be the body of the loop). As for doing a regex match: `/regex/` by itself will test the contents of $_, whereas `$var =~ /regex/` tests the contents of $var; what you have, `blahblah && =~ /regex/` is a syntax error. Also, you need to be aware that "@line1" is a completely different thing (different variable with different storage) from "$line1". To refer to an element of an array, you need square brackets and an array index: `$line1[0]` etc.	[reply] [d/l] [select]
Re^4: comparing two fasta files by nemo2 (Initiate) on Aug 02, 2008 at 14:51 UTC
hello, thanks for the code and comments. I have try to run the code, it creates a ouput file .fa.dir with lots of special characters and I couldnt read it. I dont know whats going wrong. Kindly suggest the solution. thanks!	[reply]
Re^5: comparing two fasta files by graff (Chancellor) on Aug 02, 2008 at 19:42 UTC
The output file created by that script would be the "indexed" version of the original Library.fa file. It's like a copy of the original, but reorganized and with extra "key" information (much of which is binary -- involving byte offsets to pieces of data, etc). In effect, the db file acts as a "hash", but stored on disk instead of in memory. The hash is keyed by the ID strings. The creation of the indexed "db" file is not the main point of the script, but rather just a side effect of using the Bio::DB::Fasta module to build an indexed version of a set of data. The main point of the script is to read the second input file, and for each ID string in that file, fetch the corresponding sequence data from the Library (by looking up that ID string in the indexed db file), and print the ID string together with its sequence data. This is printed to the terminal screen, but you can use redirection on the command line to have it stored in a separate file. Or you can open an explicit output file and print to that instead, like so: `use strict; use Bio::DB::Fasta; my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg +er file or die "Failed to create Fasta DB object on Library.fa\n"; open( IDLIST, "Name.fa" ); # this is the list of ID strings open( IDDATA, ">", "Name-sequences.fa" ) or die "Error opening Name-sequences.fa for output: $!"; while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print IDDATA ">$id\n", $db->seq( $id ), "\n"; }` [download] Of course, my initial suggestion (not using the Bio module) would work also, like this (not tested): use strict; my %falib; my $idstr; open( LIB, "Library.fa" ); while (<LIB>) { chomp; if ( /^>(\S+)/ ) { $idstr = $1; } elsif ( /^[acgt]+$/i ) { $falib{$idstr} .= uc(); # convert to upper-case } } open( IDLIST, "Name.fa" ); # this is the list of ID strings open( IDDATA, ">", "Name-sequences.fa" ) or die "Error opening Name-sequences.fa for output: $!"; while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print IDDATA ">$id\n", $falib{$id}, "\n" if ( exists( $falib{$id} +)); } [download] It's just that if your library file happens to hold more data than will fit in RAM as a hash variable, the script will be rather slow to run. But try it out.	[reply] [d/l] [select]
Re^4: comparing two fasta files by nemo2 (Initiate) on Aug 02, 2008 at 08:34 UTC
Hi, thanks for the code. But I have tried to run this, its creates Library.fa.index with lots of special characters unreadable sequences.	[reply]
Re^4: comparing two fasta files by sesemin (Beadle) on Jul 13, 2010 at 18:18 UTC
I changed the code in a way that prints the entire header. The thing that happens is that, if your id is long and you need to have the entire information by printing $id you may lose the informative part from any space onward. #!/usr/bin/perl use strict; use Bio::DB::Fasta; my $database; my $fasta_library; my %records; open IDFILE, "NBS_LRR_IDs.txt" or die $!; open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!; # name of the library file - (here it is hardcoded) $fasta_library = 'NC_003070_all_chr.faa'; # creates the database of the library, based on the file $database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to cr +eate Fasta DP object on fasta library\n"; # now, it parses the file with the fasta headers you want to get while (<IDFILE>) { my ($id) = (/^>*(\S+)/); # capture the id string (without the i +nitial ">") my $header = $database->header($id); #print "$header\n"; print ">$header\n", $database->seq( $id ), "\n"; print OUTPUT ">$header\n", $database->seq( $id ), "\n"; } exit; [download]	[reply] [d/l]
Re^3: comparing two fasta files by rogerd (Sexton) on Aug 11, 2008 at 21:56 UTC
Hi Nemo2, I was absent. You already have had some solutions. This is the way I used, using BioPerl. use strict; use Bio::DB::Fasta; my $database; my $fasta_library; my %records; # name of the library file - (here it is hardcoded) $fasta_library = 'Poptr1_FilteredModels_transcripts.fa'; # creates the database of the library, based on the file $database = Bio::DB::Fasta->new("$fasta_library"); # now, it parses the file with the fasta headers you want to get while (<STDIN>) { chomp (my $seq_id = $_); my $record = $database->get_Seq_by_id("$seq_id"); # incorporates the header and sequence to the hash # key = header ; value = sequence $records{$seq_id} = $record->seq(); } # print all the key-value pairs ( for (sort keys (%records)) { print ">$_\n$records{$_}\n"; } exit; [download]	[reply] [d/l]