Pasan has asked for the wisdom of the Perl Monks concerning the following question:

hi guys! I am a newbie to perl and I need help with finding list of fasta seuences which has specefic 3 letter code in the header. The question is explained below Scenario: I have a fasta file which contains list of sequences as follows (test.fa)

>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p

UGAGGUAGUAGGUUGUAUAGUU

>cfa-miR-761 MIMAT0009936 Canis familiaris miR-761

GCAGCAGGGUGAAACUGACACA

>cfa-miR-764 MIMAT0009937 Canis familiaris miR-764

GGUGCUCACUUGUCCUCCU

>lus-miR167c MIMAT0027158 Linum usitatissimum miR167c

UGAAGCUGCCAGCAUGAUCUA

I have set of organism codes in a separate file (codes.txt)

lus

cel

cfa....

what I want to do is search through test.fa for the codes and only print out the sequences which has that particular code in the header The example above contains only few sequences (not the entire file) So far I managed to create a hash and store the headers into keys and sequences into values. And I read through codes.txt and stored the codes in an array

The problem is when I use 'if exists' function to find whether each code in array exists in the hash here is my code,

#!/usr/bin/perl use warnings; use Bio::Perl; use Bio::Seq; use Bio::SeqIO; # Reading the first file and store it into a hash #the sequence header is stored in the hash key and sequence is stored +in the value my $FastaFile1 = Bio::SeqIO->new(-file => "test.fa", -format => 'fasta +', -alphabet => 'dna') or die "Failed to create SeqIO object from \n" +; my %fastaH1 =(); while( my $seqFile1 = $FastaFile1->next_seq() ) { unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc +}) { $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc} = $seqFil +e1->seq; #key of the hash is fasta header (all line) and value is seq +uence. } } # printing the fasta headers print "stored fasta headers:\n"; foreach my $key (keys %fastaH1){ print "$key\n"; } # reading the codes.txt file and creating the array open my $file, '<', "codes.txt"; chomp(my @lines = <$file>); close $file; print "stored organism codes\n"; foreach (@lines) { print " $_\n"; } # if each code is found on the hash print match is found foreach my $line (@lines) { chomp $line; if( exists $fastaH1{$line} ){ print "match found\n"; } }

The code is only printing "match found" if a match is found. But I need to write the sequences that has the 3 letter codes in a separate file. However, it doesn't seem to compare the codes and the fasta headers using exists function. Any help will be appreciated

Replies are listed 'Best First'.
Re: comparing sequence fasta headers
by Lennotoecom (Pilgrim) on Sep 16, 2014 at 19:05 UTC
    use Inline::Files; undef $h{$_} while chomp($_ = <CODES>); while(<FASTA>){ print if /(^\w{3})/ and exists $h{$1}; } __CODES__ lus cel cfa __FASTA__ where do you study your perl? cel-let-7-5-blah blah blah cfa-miR-761-blah blah blah cfa-miR-764-blah blah blah lus-miR167-blah blah blah other-blah blah- 123
Re: comparing sequence fasta headers
by biohisham (Priest) on Sep 17, 2014 at 13:24 UTC

    Hello Pasan and welcome to the Monastery.

    Of course your code will print "match found" because that is what you are telling it to do. Dump the content of $fastaH1 through Data::Dumper to see how your hash looks like. My piece of advice is that if you are reading from a Bio::SeqIO object then you got to write your output through a Bio::SeqIO object as well. That way you will save yourself from a lot of the unexpected Bio::* behaviors.

    other things I noticed in your code is that your open statement doesn't capture failure. Alsoe, if you're using a Bio::SeqIO then you don't need to explicitly import all the parent modules.

    use strict; #a good habit to start your prog use warnings; use Bio::SeqIO; #load your codes.txt into an array open(my $fh, "<","codes.txt") or die("could not open codes.txt $!"); # +a more proper way to opening files my @codes=<$fh>; #slurping codes.txt into an array my $obj=Bio::SeqIO->new( #Bio::SeqIO for input -file=>"fasta.fa", -format=>"fasta" ); my $out_obj=Bio::SeqIO->new( #Bio::SeqIO for output -file=>">matching.fa", -format=>"fasta" ); while(my $seq=$obj->next_seq){ foreach my $element (@codes){ chomp $element; if($seq->id=~/$element*?/gi){ #regular-expression based mat +ching $out_obj->write_seq($seq); #output written to matching.fa } } }

    David R. Gergen said "We know that second terms have historically been marred by hubris and by scandal." and I am a two y.o. monk today :D, June,12th, 2011...