nemo2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have two files which needs to be compared. first file:
>Contig38567_g4_t1 >Contig38950_g53_t1 >Contig38717_g17_t1 >Contig38414_g32_t1
second file:
>Contig38567_g4_t1 agagagctctcttatatattatat >Contig38950_g53_t1 aacacacgtgttatatctctctctctct
First file has 1000 something names, the second file has 70000 sequences. I would like to collect the sequences from the second file which only match with the names of the first file. Im a newbie to perl. can anylone help me. I sincerely appreciate ur kind help. thanks!!

20080729 Janitored by Corion: Added formatting, code tags, as per Writeup Formatting Tips

Replies are listed 'Best First'.
Re: comparing two fasta files
by moritz (Cardinal) on Jul 28, 2008 at 15:50 UTC

    The zeroth step is to read perlintro, which introduces you to perl.

    The first step is to look at http://search.cpan.org/, there are some modules that deal with fasta files.

    The second step is to read about hashes.

    Then you start hacking. Good luck. If you have problems on your way, come back here and ask.

Re: comparing two fasta files
by rogerd (Sexton) on Jul 28, 2008 at 20:55 UTC

    Hi nemo2, I am a begginer too, but I want to suggest you some tips that will help you in this forum:

    - read the posts regarding the rules in posting, the formmating to tips, etc. If you follow the rules you will have more responses

    - second, you should explain your problem thinking that you are talking to computer people, and not to biologists. I believe that most of the people here doesn't know anything about the fasta format, or sequences, or gene names, and they don't care and they don't even have to know anything about biology to help you. So, a fasta header is just a line that starts with ">", and a sequence is just a string, or several lines below that line with the ">" symbol. Reflect the format of the text file in your post, so people know what you are talking about.

    And now, it is not a hard job what you have to do, but you need to know some of perl, like:

    - reading input files and writing to output files

    - using regular expressions to "capture the fasta headers and the corresponding sequences

    - using hashes to store the sequence_name and the sequence as pair of key-values and create a lookup table

    - and probably some more...

    A good book as an introduction of Perl for biologists, and how you can use Perl in your bioinformatic tasks is "Beggining Perl for bioinformatics", James Tisdall, ed. O'Reilly. There you can find what you need to start with perl in bioinformatics, using examples from biology

    I hope this can help you

Re: comparing two fasta files
by rogerd (Sexton) on Jul 29, 2008 at 12:56 UTC

    Hi again,

    I guess that your second file is a library, downloaded from the genebank or from a genome database. And you want to extract from that library some sequences that are of your interst (you got them as a result of some experiment, or blast search, or something). If I am correct, I suposse that you will be repeating the same work several times, searching for different sequences. I think that a good approach would be to create an indexed database for the second file (the library, which will not change), and then search that database by genename.

    You can use the Bio::DB::Fasta module (you can read the perldoc of that module, and there are more documentation in BioPerl), which will facilitate you the task, but it would be good that you understand what is going on. In pseudocode, it should be:

    use Bio::DB::Fasta get the name of the second_file(library) create the database (use the "new" function of Bio::DB::Fasta while read each line of first_file(your_seqs) get the name look for the corresponding sequence in the db (use a module functi +on)

    This is the basic way that I use to do. Maybe there are better ways, but in a few lines of code I have the work done, and it does it fast.

    Once you get the basics of Perl, the BioPerl modules will be very helpfull to you.

      Hi Rogerd, Ive written few lines of code after learning about perl and working out some simple progreams. I show my code, I dont know how to proceed further.Could you help? thank you.
      use Bio::DB::Fasta; my $db = new Bio::DB::Fasta("Library.fa"); open(INPUT1, "Library.fa") or die "Cannot open Hprot.fa"; open(INPUT2, "Name.fa") or die "Cannot open Hprot.fa"; open(output, ">result.fa"); my(@lines1) = <INPUT1>; my(@lines2) = <INPUT2>; while $lines1 eq $lines2 && =~ /^Contig*/
        I've never used Bio::DB::Fasta, but having read the man page for it just now, and looking again the OP task description, I think what you want is something like this:
        use strict; use Bio::DB::Fasta; my $db = new Bio::DB::Fasta( "Library.fa" ) # this should be the bigg +er file or die "Failed to create Fasta DB object on Library.fa\n"; open( IDLIST, "Name.fa" ); # this is the list of ID strings while ( <IDLIST> ) { my ($id) = (/^>(\S+)/); # capture the id string (without the init +ial ">") print ">$id\n", $db->seq( $id ), "\n"; }
        As for the code you've just posted, you didn't really try to use the DB::Fasta module, and you don't seem to understand the syntax for a while loop (you need parens around the conditional, and a block of code to be the body of the loop). As for doing a regex match:  /regex/ by itself will test the contents of $_, whereas $var =~ /regex/ tests the contents of $var; what you have,  blahblah && =~ /regex/ is a syntax error.

        Also, you need to be aware that "@line1" is a completely different thing (different variable with different storage) from "$line1". To refer to an element of an array, you need square brackets and an array index:  $line1[0] etc.

        Hi Nemo2, I was absent. You already have had some solutions. This is the way I used, using BioPerl.

        use strict; use Bio::DB::Fasta; my $database; my $fasta_library; my %records; # name of the library file - (here it is hardcoded) $fasta_library = 'Poptr1_FilteredModels_transcripts.fa'; # creates the database of the library, based on the file $database = Bio::DB::Fasta->new("$fasta_library"); # now, it parses the file with the fasta headers you want to get while (<STDIN>) { chomp (my $seq_id = $_); my $record = $database->get_Seq_by_id("$seq_id"); # incorporates the header and sequence to the hash # key = header ; value = sequence $records{$seq_id} = $record->seq(); } # print all the key-value pairs ( for (sort keys (%records)) { print ">$_\n$records{$_}\n"; } exit;
Re: comparing two fasta files
by graff (Chancellor) on Jul 29, 2008 at 05:24 UTC
    basic approach (in "pseudocode"):
    open the first file (the one with 1000-something names) while reading one record at a time save the "name" part of the line is as a "hash key" open the second file while reading one record at a time check for the "name" portion (e.g. use a regular expression to "capture" the name) if the name exists as a key in the hash, print this record
    That's it. You just need to work out how to read a record at a time, how identify and capture the "name" part of each record using a regex, and how to use a hash. The resources cited in the first reply will be helpful.

    Try it out, and if you have trouble with code you've tried, show us the code. You'll learn more that way.

      thanks a lot for your time and help. many thanks!!!