I've never used Bio::DB::Fasta, but having read the man page for it just now, and looking again the OP task description, I think what you want is something like this:
use strict;
use Bio::DB::Fasta;
my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg
+er file
or die "Failed to create Fasta DB object on Library.fa\n";
open( IDLIST, "Name.fa" ); # this is the list of ID strings
while ( <IDLIST> ) {
my ($id) = (/^>(\S+)/); # capture the id string (without the init
+ial ">")
print ">$id\n", $db->seq( $id ), "\n";
}
As for the code you've just posted, you didn't really try to use the DB::Fasta module, and you don't seem to understand the syntax for a while loop (you need parens around the conditional, and a block of code to be the body of the loop). As for doing a regex match: /regex/ by itself will test the contents of $_, whereas $var =~ /regex/ tests the contents of $var; what you have, blahblah && =~ /regex/ is a syntax error.
Also, you need to be aware that "@line1" is a completely different thing (different variable with different storage) from "$line1". To refer to an element of an array, you need square brackets and an array index: $line1[0] etc. | [reply] [d/l] [select] |
hello, thanks for the code and comments. I have try to run the code, it creates a ouput file .fa.dir with lots of special characters and I couldnt read it. I dont know whats going wrong. Kindly suggest the solution.
thanks!
| [reply] |
The output file created by that script would be the "indexed" version of the original Library.fa file. It's like a copy of the original, but reorganized and with extra "key" information (much of which is binary -- involving byte offsets to pieces of data, etc). In effect, the db file acts as a "hash", but stored on disk instead of in memory. The hash is keyed by the ID strings.
The creation of the indexed "db" file is not the main point of the script, but rather just a side effect of using the Bio::DB::Fasta module to build an indexed version of a set of data.
The main point of the script is to read the second input file, and for each ID string in that file, fetch the corresponding sequence data from the Library (by looking up that ID string in the indexed db file), and print the ID string together with its sequence data. This is printed to the terminal screen, but you can use redirection on the command line to have it stored in a separate file. Or you can open an explicit output file and print to that instead, like so:
use strict;
use Bio::DB::Fasta;
my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg
+er file
or die "Failed to create Fasta DB object on Library.fa\n";
open( IDLIST, "Name.fa" ); # this is the list of ID strings
open( IDDATA, ">", "Name-sequences.fa" )
or die "Error opening Name-sequences.fa for output: $!";
while ( <IDLIST> ) {
my ($id) = (/^>(\S+)/); # capture the id string (without the init
+ial ">")
print IDDATA ">$id\n", $db->seq( $id ), "\n";
}
Of course, my initial suggestion (not using the Bio module) would work also, like this (not tested):
use strict;
my %falib;
my $idstr;
open( LIB, "Library.fa" );
while (<LIB>) {
chomp;
if ( /^>(\S+)/ ) {
$idstr = $1;
}
elsif ( /^[acgt]+$/i ) {
$falib{$idstr} .= uc(); # convert to upper-case
}
}
open( IDLIST, "Name.fa" ); # this is the list of ID strings
open( IDDATA, ">", "Name-sequences.fa" )
or die "Error opening Name-sequences.fa for output: $!";
while ( <IDLIST> ) {
my ($id) = (/^>(\S+)/); # capture the id string (without the init
+ial ">")
print IDDATA ">$id\n", $falib{$id}, "\n" if ( exists( $falib{$id}
+));
}
It's just that if your library file happens to hold more data than will fit in RAM as a hash variable, the script will be rather slow to run. But try it out. | [reply] [d/l] [select] |
Hi, thanks for the code. But I have tried to run this, its creates Library.fa.index with lots of special characters unreadable sequences.
| [reply] |
#!/usr/bin/perl
use strict;
use Bio::DB::Fasta;
my $database;
my $fasta_library;
my %records;
open IDFILE, "NBS_LRR_IDs.txt" or die $!;
open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!;
# name of the library file - (here it is hardcoded)
$fasta_library = 'NC_003070_all_chr.faa';
# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to cr
+eate Fasta DP object on fasta library\n";
# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {
my ($id) = (/^>*(\S+)/); # capture the id string (without the i
+nitial ">")
my $header = $database->header($id);
#print "$header\n";
print ">$header\n", $database->seq( $id ), "\n";
print OUTPUT ">$header\n", $database->seq( $id ), "\n";
}
exit;
| [reply] [d/l] |
Hi Nemo2, I was absent. You already have had some solutions. This is the way I used, using BioPerl.
use strict;
use Bio::DB::Fasta;
my $database;
my $fasta_library;
my %records;
# name of the library file - (here it is hardcoded)
$fasta_library = 'Poptr1_FilteredModels_transcripts.fa';
# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library");
# now, it parses the file with the fasta headers you want to get
while (<STDIN>) {
chomp (my $seq_id = $_);
my $record = $database->get_Seq_by_id("$seq_id");
# incorporates the header and sequence to the hash
# key = header ; value = sequence
$records{$seq_id} = $record->seq();
}
# print all the key-value pairs (
for (sort keys (%records)) {
print ">$_\n$records{$_}\n";
}
exit;
| [reply] [d/l] |