Oh thy masters of Perl wisdom, please enlighten me. I am struggling with some problems reading Genebank files (.gbff) through Bioperl.

I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq and $feat->get_tag_values("translation")). My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1 +999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and +Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ... JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale Univer +sity, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYN...RTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of +S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQL...GRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEK...IFGSLF" ORIGIN 1 gatcctccat ata...tgatc //

If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format

>FEAT | FEAT | FEAT | FEAT agcgtgacgattvcatgt...catgcatgcag

and

>FEAT | FEAT | FEAT | FEAT AYTKRCL...MRTCKYS

Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field)

So my questions are the following:

1. Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a if($feat->spliced_seq->seq) fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases?

2. How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie; in the preamble, but I think the die command is being given by Bioperl.

Note: this was now (27 Oct) crossposted on Biostars in here


In reply to Debugging Bioperl warnings for Genebank files that are missing info by Sosi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.