in reply to comparing two fasta files

Hi again,

I guess that your second file is a library, downloaded from the genebank or from a genome database. And you want to extract from that library some sequences that are of your interst (you got them as a result of some experiment, or blast search, or something). If I am correct, I suposse that you will be repeating the same work several times, searching for different sequences. I think that a good approach would be to create an indexed database for the second file (the library, which will not change), and then search that database by genename.

You can use the Bio::DB::Fasta module (you can read the perldoc of that module, and there are more documentation in BioPerl), which will facilitate you the task, but it would be good that you understand what is going on. In pseudocode, it should be:

use Bio::DB::Fasta get the name of the second_file(library) create the database (use the "new" function of Bio::DB::Fasta while read each line of first_file(your_seqs) get the name look for the corresponding sequence in the db (use a module functi +on)

This is the basic way that I use to do. Maybe there are better ways, but in a few lines of code I have the work done, and it does it fast.

Once you get the basics of Perl, the BioPerl modules will be very helpfull to you.

Replies are listed 'Best First'.
Re^2: comparing two fasta files
by nemo2 (Initiate) on Aug 01, 2008 at 14:58 UTC
    Hi Rogerd, Ive written few lines of code after learning about perl and working out some simple progreams. I show my code, I dont know how to proceed further.Could you help? thank you.
    use Bio::DB::Fasta; my $db = new Bio::DB::Fasta("Library.fa"); open(INPUT1, "Library.fa") or die "Cannot open Hprot.fa"; open(INPUT2, "Name.fa") or die "Cannot open Hprot.fa"; open(output, ">result.fa"); my(@lines1) = <INPUT1>; my(@lines2) = <INPUT2>; while $lines1 eq $lines2 && =~ /^Contig*/
      I've never used Bio::DB::Fasta, but having read the man page for it just now, and looking again the OP task description, I think what you want is something like this:
      use strict; use Bio::DB::Fasta; my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg +er file or die "Failed to create Fasta DB object on Library.fa\n"; open( IDLIST, "Name.fa" ); # this is the list of ID strings while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print ">$id\n", $db->seq( $id ), "\n"; }
      As for the code you've just posted, you didn't really try to use the DB::Fasta module, and you don't seem to understand the syntax for a while loop (you need parens around the conditional, and a block of code to be the body of the loop). As for doing a regex match:  /regex/ by itself will test the contents of $_, whereas $var =~ /regex/ tests the contents of $var; what you have,  blahblah && =~ /regex/ is a syntax error.

      Also, you need to be aware that "@line1" is a completely different thing (different variable with different storage) from "$line1". To refer to an element of an array, you need square brackets and an array index:  $line1[0] etc.

        hello, thanks for the code and comments. I have try to run the code, it creates a ouput file .fa.dir with lots of special characters and I couldnt read it. I dont know whats going wrong. Kindly suggest the solution. thanks!
        Hi, thanks for the code. But I have tried to run this, its creates Library.fa.index with lots of special characters unreadable sequences.

        I changed the code in a way that prints the entire header. The thing that happens is that, if your id is long and you need to have the entire information by printing $id you may lose the informative part from any space onward.

        #!/usr/bin/perl use strict; use Bio::DB::Fasta; my $database; my $fasta_library; my %records; open IDFILE, "NBS_LRR_IDs.txt" or die $!; open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!; # name of the library file - (here it is hardcoded) $fasta_library = 'NC_003070_all_chr.faa'; # creates the database of the library, based on the file $database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to cr +eate Fasta DP object on fasta library\n"; # now, it parses the file with the fasta headers you want to get while (<IDFILE>) { my ($id) = (/^>*(\S+)/); # capture the id string (without the i +nitial ">") my $header = $database->header($id); #print "$header\n"; print ">$header\n", $database->seq( $id ), "\n"; print OUTPUT ">$header\n", $database->seq( $id ), "\n"; } exit;

      Hi Nemo2, I was absent. You already have had some solutions. This is the way I used, using BioPerl.

      use strict; use Bio::DB::Fasta; my $database; my $fasta_library; my %records; # name of the library file - (here it is hardcoded) $fasta_library = 'Poptr1_FilteredModels_transcripts.fa'; # creates the database of the library, based on the file $database = Bio::DB::Fasta->new("$fasta_library"); # now, it parses the file with the fasta headers you want to get while (<STDIN>) { chomp (my $seq_id = $_); my $record = $database->get_Seq_by_id("$seq_id"); # incorporates the header and sequence to the hash # key = header ; value = sequence $records{$seq_id} = $record->seq(); } # print all the key-value pairs ( for (sort keys (%records)) { print ">$_\n$records{$_}\n"; } exit;