Re^4: comparing two fasta files

Replies are listed 'Best First'.
Re^5: comparing two fasta files by graff (Chancellor) on Aug 02, 2008 at 19:42 UTC
The output file created by that script would be the "indexed" version of the original Library.fa file. It's like a copy of the original, but reorganized and with extra "key" information (much of which is binary -- involving byte offsets to pieces of data, etc). In effect, the db file acts as a "hash", but stored on disk instead of in memory. The hash is keyed by the ID strings. The creation of the indexed "db" file is not the main point of the script, but rather just a side effect of using the Bio::DB::Fasta module to build an indexed version of a set of data. The main point of the script is to read the second input file, and for each ID string in that file, fetch the corresponding sequence data from the Library (by looking up that ID string in the indexed db file), and print the ID string together with its sequence data. This is printed to the terminal screen, but you can use redirection on the command line to have it stored in a separate file. Or you can open an explicit output file and print to that instead, like so: `use strict; use Bio::DB::Fasta; my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg +er file or die "Failed to create Fasta DB object on Library.fa\n"; open( IDLIST, "Name.fa" ); # this is the list of ID strings open( IDDATA, ">", "Name-sequences.fa" ) or die "Error opening Name-sequences.fa for output: $!"; while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print IDDATA ">$id\n", $db->seq( $id ), "\n"; }` [download] Of course, my initial suggestion (not using the Bio module) would work also, like this (not tested): use strict; my %falib; my $idstr; open( LIB, "Library.fa" ); while (<LIB>) { chomp; if ( /^>(\S+)/ ) { $idstr = $1; } elsif ( /^[acgt]+$/i ) { $falib{$idstr} .= uc(); # convert to upper-case } } open( IDLIST, "Name.fa" ); # this is the list of ID strings open( IDDATA, ">", "Name-sequences.fa" ) or die "Error opening Name-sequences.fa for output: $!"; while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print IDDATA ">$id\n", $falib{$id}, "\n" if ( exists( $falib{$id} +)); } [download] It's just that if your library file happens to hold more data than will fit in RAM as a hash variable, the script will be rather slow to run. But try it out.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: comparing two fasta files
by graff (Chancellor) on Aug 02, 2008 at 19:42 UTC

The creation of the indexed "db" file is not the main point of the script, but rather just a side effect of using the Bio::DB::Fasta module to build an indexed version of a set of data.

The main point of the script is to read the second input file, and for each ID string in that file, fetch the corresponding sequence data from the Library (by looking up that ID string in the indexed db file), and print the ID string together with its sequence data. This is printed to the terminal screen, but you can use redirection on the command line to have it stored in a separate file. Or you can open an explicit output file and print to that instead, like so:

use strict;
use Bio::DB::Fasta;

my $db = new Bio::DB::Fasta( "Library.fa" )  # this should be the bigg
+er file
    or die "Failed to create Fasta DB object on Library.fa\n";

open( IDLIST, "Name.fa" );  # this is the list of ID strings
open( IDDATA, ">", "Name-sequences.fa" )
    or die "Error opening Name-sequences.fa for output: $!";

while ( <IDLIST> ) {
    my ($id) = (/^>(\S+)/);  # capture the id string (without the init
+ial ">")
    print IDDATA ">$id\n", $db->seq( $id ), "\n";
}
[download]

my initial suggestion

use strict;

my %falib;
my $idstr;

open( LIB, "Library.fa" );
while (<LIB>) {
    chomp;
    if ( /^>(\S+)/ ) {
        $idstr = $1;
    }
    elsif ( /^[acgt]+$/i ) {
        $falib{$idstr} .= uc();  # convert to upper-case
    }
}

open( IDLIST, "Name.fa" );  # this is the list of ID strings
open( IDDATA, ">", "Name-sequences.fa" )
    or die "Error opening Name-sequences.fa for output: $!";

while ( <IDLIST> ) {
    my ($id) = (/^>(\S+)/);  # capture the id string (without the init
+ial ">")
    print IDDATA ">$id\n", $falib{$id}, "\n" if ( exists( $falib{$id} 
+));
}
[download]

[reply]
[d/l]
[select]