Sosi has asked for the wisdom of the Perl Monks concerning the following question:
Oh thy masters of Perl wisdom, please enlighten me. I am struggling with some problems reading Genebank files (.gbff) through Bioperl.
I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq and $feat->get_tag_values("translation")). My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1 +999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and +Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ... JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale Univer +sity, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYN...RTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of +S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQL...GRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEK...IFGSLF" ORIGIN 1 gatcctccat ata...tgatc //
If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format
>FEAT | FEAT | FEAT | FEAT agcgtgacgattvcatgt...catgcatgcag
and
>FEAT | FEAT | FEAT | FEAT AYTKRCL...MRTCKYS
Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field)
So my questions are the following:
1. Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a if($feat->spliced_seq->seq) fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases?
2. How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie; in the preamble, but I think the die command is being given by Bioperl.
Note: this was now (27 Oct) crossposted on Biostars in here
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Debugging Bioperl warnings for Genebank files that are missing info
by biohisham (Priest) on Oct 24, 2014 at 00:43 UTC | |
by Sosi (Sexton) on Oct 24, 2014 at 14:27 UTC | |
by erix (Prior) on Oct 25, 2014 at 14:33 UTC | |
by Sosi (Sexton) on Oct 27, 2014 at 10:30 UTC | |
by Anonymous Monk on Oct 25, 2014 at 13:43 UTC |