Pasan has asked for the wisdom of the Perl Monks concerning the following question:
hi guys! I am a newbie to perl and I need help with finding list of fasta seuences which has specefic 3 letter code in the header. The question is explained below Scenario: I have a fasta file which contains list of sequences as follows (test.fa)
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cfa-miR-761 MIMAT0009936 Canis familiaris miR-761
GCAGCAGGGUGAAACUGACACA
>cfa-miR-764 MIMAT0009937 Canis familiaris miR-764
GGUGCUCACUUGUCCUCCU
>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c
UGAAGCUGCCAGCAUGAUCUA
I have set of organism codes in a separate file (codes.txt)
lus
cel
cfa....
what I want to do is search through test.fa for the codes and only print out the sequences which has that particular code in the header The example above contains only few sequences (not the entire file) So far I managed to create a hash and store the headers into keys and sequences into values. And I read through codes.txt and stored the codes in an array
The problem is when I use 'if exists' function to find whether each code in array exists in the hash here is my code,
#!/usr/bin/perl use warnings; use Bio::Perl; use Bio::Seq; use Bio::SeqIO; # Reading the first file and store it into a hash #the sequence header is stored in the hash key and sequence is stored +in the value my $FastaFile1 = Bio::SeqIO->new(-file => "test.fa", -format => 'fasta +', -alphabet => 'dna') or die "Failed to create SeqIO object from \n" +; my %fastaH1 =(); while( my $seqFile1 = $FastaFile1->next_seq() ) { unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc +}) { $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc} = $seqFil +e1->seq; #key of the hash is fasta header (all line) and value is seq +uence. } } # printing the fasta headers print "stored fasta headers:\n"; foreach my $key (keys %fastaH1){ print "$key\n"; } # reading the codes.txt file and creating the array open my $file, '<', "codes.txt"; chomp(my @lines = <$file>); close $file; print "stored organism codes\n"; foreach (@lines) { print " $_\n"; } # if each code is found on the hash print match is found foreach my $line (@lines) { chomp $line; if( exists $fastaH1{$line} ){ print "match found\n"; } }
The code is only printing "match found" if a match is found. But I need to write the sequences that has the 3 letter codes in a separate file. However, it doesn't seem to compare the codes and the fasta headers using exists function. Any help will be appreciated
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: comparing sequence fasta headers
by Lennotoecom (Pilgrim) on Sep 16, 2014 at 19:05 UTC | |
Re: comparing sequence fasta headers
by biohisham (Priest) on Sep 17, 2014 at 13:24 UTC |