comment on

The output file created by that script would be the "indexed" version of the original Library.fa file. It's like a copy of the original, but reorganized and with extra "key" information (much of which is binary -- involving byte offsets to pieces of data, etc). In effect, the db file acts as a "hash", but stored on disk instead of in memory. The hash is keyed by the ID strings.

The creation of the indexed "db" file is not the main point of the script, but rather just a side effect of using the Bio::DB::Fasta module to build an indexed version of a set of data.

The main point of the script is to read the second input file, and for each ID string in that file, fetch the corresponding sequence data from the Library (by looking up that ID string in the indexed db file), and print the ID string together with its sequence data. This is printed to the terminal screen, but you can use redirection on the command line to have it stored in a separate file. Or you can open an explicit output file and print to that instead, like so:

use strict;
use Bio::DB::Fasta;

my $db = new Bio::DB::Fasta( "Library.fa" )  # this should be the bigg
+er file
    or die "Failed to create Fasta DB object on Library.fa\n";

open( IDLIST, "Name.fa" );  # this is the list of ID strings
open( IDDATA, ">", "Name-sequences.fa" )
    or die "Error opening Name-sequences.fa for output: $!";

while ( <IDLIST> ) {
    my ($id) = (/^>(\S+)/);  # capture the id string (without the init
+ial ">")
    print IDDATA ">$id\n", $db->seq( $id ), "\n";
}
[download]

Of course, my initial suggestion (not using the Bio module) would work also, like this (not tested):

use strict;

my %falib;
my $idstr;

open( LIB, "Library.fa" );
while (<LIB>) {
    chomp;
    if ( /^>(\S+)/ ) {
        $idstr = $1;
    }
    elsif ( /^[acgt]+$/i ) {
        $falib{$idstr} .= uc();  # convert to upper-case
    }
}

open( IDLIST, "Name.fa" );  # this is the list of ID strings
open( IDDATA, ">", "Name-sequences.fa" )
    or die "Error opening Name-sequences.fa for output: $!";

while ( <IDLIST> ) {
    my ($id) = (/^>(\S+)/);  # capture the id string (without the init
+ial ">")
    print IDDATA ">$id\n", $falib{$id}, "\n" if ( exists( $falib{$id} 
+));
}
[download]

It's just that if your library file happens to hold more data than will fit in RAM as a hash variable, the script will be rather slow to run. But try it out.

In reply to Re^5: comparing two fasta files by graff
in thread comparing two fasta files by nemo2

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.