genbank has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I want to use Bio::Tools::Run::Tmhmm to analysis the protein seqence. Firstly, I should feed it the fasta file with Bio::SeqIO. However, the Bio::SeqIO module seem not work and report error as following.

------------- EXCEPTION: Bio::Root::Exception -------------

MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>')

STACK: Error::throw

STACK: Bio::Root::Root::throw /localperl/lib/site_perl/5.12.4/Bio/Root/Root.pm:368

STACK: Bio::SeqIO::fasta::next_seq /localperl/lib/site_perl/5.12.4/Bio/SeqIO/fasta.pm:127

STACK: tmhmm.pl:6

-----------------------------------------------------------

The code is :

#!/localperl/bin/perl use strict; use warnings; use Bio::SeqIO; my $inseq = Bio::SeqIO->new(-file => '/home/zhaoy/document/perl/lwp/ +seq.txt',-format => 'FASTA' ) or die "can't open"; while (my $seq = $inseq->next_seq()) { print $seq->accession_number,"\n"; }

The error message says The sequence does not appear to be FASTA format (lacks a descriptor line '>'),but i am sure all the seqence begin with ">". All the seqences like the seqence as follow:

>gi|1786958|gb|AAC73831.1| membrane spanning protein in TolA-TolQ-TolR complex Escherichia coli str. K-12 substr. MG1655 MTDMNILDLFLKASLLVKLIMLILIGFSIASWAIIIQRTRILNAAAREAEAFEDKFWSGIELSRLYQESQ GKRDNLTGSEQIFYSGFKEFVRLHRANSHAPEAVVEGASRAMRISMNRELENLETHIPFLGTVGSISPYI GLFGTVWGIMHAFIALGAVKQATLQMVAPGIAEALIATAIGLFAAIPAVMAYNRLNQRVNKLELNYDNFM EEFTAILHRQAFTVSESNKG

I don't know what is wrong with this code. Thanks in advance!

Replies are listed 'Best First'.
Re: A strange error message with Bio::SeqIO
by roboticus (Chancellor) on Aug 26, 2011 at 10:58 UTC

    genbank:

    You're sure that all sequences are fasta format, and it's sure that at least one isn't. You need to *prove* it one way or the other. I'd suggest printing out the line number or something as you process through the file so you can find out where in your file it's having trouble. Then double and/or triple check the input.

    Also, take the error message with a grain of salt: Sometimes an error isn't exactly what's claimed. For example, if a descriptor line requires four vertical bars, the record might not be recognized as a descriptor line and generate such a message.

    NOTE: I've never used the Bio::SeqIO module (nor any of the other bio modules), so take my advice with a grain of salt. I'm just speaking from experience with other things.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: A strange error message with Bio::SeqIO
by Khen1950fx (Canon) on Aug 26, 2011 at 11:44 UTC
    You were missing a few things. Here's the code as I think it should be:
    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "/root/Desktop/seqIOnew.fasta", -format => 'FASTA'); my $out = Bio::SeqIO->new( -file => ">/root/Desktop/new_out.log", -format => "EMBL" ); while ( my $seq = $in->next_seq() ) { print $out->write_seq($seq), "\n"; }
    For the input, I took your sequence and put it in a file called seqIOnew.fasta. I tried to get the accession number, but it returned "unknown". Printing out the $seq, you'll see that there is no accession number, so that's why it came back unknown.

    For the output, I used new_out.log on my desktop.

    Here's the corrected sequence:

    >gi|1786958|gb|AAC73831.1| membrane spanning protein in TolA-TolQ-TolR + complex [Escherichia coli str. K-12 substr. MG1655] MTDMNILDLFLKASLL +VKLIMLILIGFSIASWAIIIQRTRILNAAAREAEAFEDKFWSGIELSRLYQESQ GKRDNLTGSEQIFY +SGFKEFVRLHRANSHAPEAVVEGASRAMRISMNRELENLETHIPFLGTVGSISPYI GLFGTVWGIMHA +FIALGAVKQATLQMVAPGIAEALIATAIGLFAAIPAVMAYNRLNQRVNKLELNYDNFM EEFTAILHRQ +AFTVSESNKG
    And here's the output:
    ID unknown; SV 1; linear; ; STD; UNC; 0 BP. XX AC unknown; XX DE membrane spanning protein in TolA-TolQ-TolR complex [Escherichia +coli str. DE K-12 substr. MG1655] DE MTDMNILDLFLKASLLVKLIMLILIGFSIASWAIIIQRTRILNAAAREAEAFEDKFWSGIELSRL +YQESQ DE GKRDNLTGSEQIFYSGFKEFVRLHRANSHAPEAVVEGASRAMRISMNRELENLETHIPFLGTVGS +ISPYI DE GLFGTVWGIMHAFIALGAVKQATLQMVAPGIAEALIATAIGLFAAIPAVMAYNRLNQRVNKLELN +YDNFM DE EEFTAILHRQAFTVSESNKG XX FH Key Location/Qualifiers FH XX //

      Hi Khen1950fx, thank you for your help. I have copyed you code, and run on my computer(just change the path of the file). However I get the same error message, "the sequence does not to be FASTA format(lacks a descriptor line '>')". The code I run as following:

      #!/localperl/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "/home/zhaoy/document/perl/lwp/seq.fasta", -format => 'FASTA'); my $out = Bio::SeqIO->new( -file => ">/home/zhaoy/document/perl/lwp/new_out.log", -format => "EMBL" ); while ( my $seq = $in->next_seq() ) { print $out->write_seq($seq), "\n"; }

      I receive the error message:

      ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: The sequence does not appear to be FASTA format (lacks a descript +or line '>') STACK: Error::throw STACK: Bio::Root::Root::throw /localperl/lib/site_perl/5.12.4/Bio/Root +/Root.pm:368 STACK: Bio::SeqIO::fasta::next_seq /localperl/lib/site_perl/5.12.4/Bio +/SeqIO/fasta.pm:127 STACK: tmhmm.pl:12 -----------------------------------------------------------

      The same program, gain different results. Maybe the problem is that the installation of bioperl. But I am not sure.

        You probably didn't use the corrected sequence that I gave you:).
Re: A strange error message with Bio::SeqIO
by BrowserUk (Patriarch) on Aug 27, 2011 at 02:34 UTC

    Sounds like you have a corrupt fasta file. Try running this against it:

    perl -nlE"/^>/ or /^[A-Z]+$/ or say 'Bad FASTA at ', $." theFile.fasta

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.