Debugging Bioperl warnings for Genebank files that are missing info

Sosi has asked for the wisdom of the Perl Monks concerning the following question:

Oh thy masters of Perl wisdom, please enlighten me. I am struggling with some problems reading Genebank files (.gbff) through Bioperl.

I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq and $feat->get_tag_values("translation")). My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:

LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1
+999
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and 
+Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds.
...
  JOURNAL   Submitted (22-FEB-1996) Terry Roemer, Biology, Yale Univer
+sity, New
            Haven, CT, USA
FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
                     /chromosome="IX"
                     /map="9"
     CDS             <1..206
                     /codon_start=3
                     /product="TCP1-beta"
                     /protein_id="AAA98665.1"
                     /db_xref="GI:1293614"
                     /translation="SSIYN...RTANRQHM"
     gene            687..3158
                     /gene="AXL2"
     CDS             687..3158
                     /gene="AXL2"
                     /note="plasma membrane glycoprotein"
                     /codon_start=1
                     /function="required for axial budding pattern of 
+S.
                     cerevisiae"
                     /product="Axl2p"
                     /protein_id="AAA98666.1"
                     /db_xref="GI:1293615"
                     /translation="MTQL...GRIPEML"
     gene            complement(3300..4037)
                     /gene="REV7"
     CDS             complement(3300..4037)
                     /gene="REV7"
                     /codon_start=1
                     /product="Rev7p"
                     /protein_id="AAA98667.1"
                     /db_xref="GI:1293616"
                     /translation="MNRWVEK...IFGSLF"
ORIGIN
        1 gatcctccat ata...tgatc
//
[download]

If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format

>FEAT | FEAT | FEAT | FEAT
agcgtgacgattvcatgt...catgcatgcag
[download]

and

>FEAT | FEAT | FEAT | FEAT
AYTKRCL...MRTCKYS
[download]

Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field)

So my questions are the following:

1. Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a if($feat->spliced_seq->seq) fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases?

2. How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie; in the preamble, but I think the die command is being given by Bioperl.

Note: this was now (27 Oct) crossposted on Biostars in here

Comment on Debugging Bioperl warnings for Genebank files that are missing info Select or Download Code

Replies are listed 'Best First'.
Re: Debugging Bioperl warnings for Genebank files that are missing info by biohisham (Priest) on Oct 24, 2014 at 00:43 UTC
Yes at times the genbank files can be problematic in that they are incomplete or that BioPerl gets cranky, you have not provided a code that I can test but if you may consider the following workaround, work with the fasta files in conjunction with the feature table provided in the genbank files convert the genbank to gff through (genbank2gff3.pl) convert the genbank files to fasta or download the fasta equivalent parse the gff files and extract the CDs with their coordinate information `perl -F'\t' -lane 'if($F[2] eq "CDS"){print}' GCA_000153565.1_ASM15356v1_genomic.gbff.gff \| cut -f3,4,5,7 > GCA_000153565.1_ASM15356v1_genomic.coordinates.txt` extract the subsequences from the fasta files using the coordinates saved in `GCA_000153565.1_ASM15356v1_genomic.coordinates.txt` For the last item you may use BioPerl::SeqIO $seq->subseq(start..stop) but make sure you get the reverse translation of the seqs in the negative strand A 4 year old monk	[reply] [d/l] [select]
Re^2: Debugging Bioperl warnings for Genebank files that are missing info by Sosi (Sexton) on Oct 24, 2014 at 14:27 UTC
Thanks! I can get your two first points by simply downloading the files from where I got the .gbff (i.e. this link has the .gbff, .gff, and .faa files for one of the organisms), though I can see that using genbank2gff3.pl is probably easier/faster but I'm not sure the output will be the same. I believe BioPerl::SeqIO already takes into account the reverse complement there, but I'll make sure. More importantly, if those .gbff don't have the sequences as in my examples, extracting them will not be possible anyway. I am wondering why many of these files are wrongly deposited in the first place. And if they are, why isn't this automatically corrected by NCBI itself... I was expecting this to be much easier than it is proving to be. but I guess this isn't the place to rant about this eh :) Off topic:If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...	[reply]
Re^3: Debugging Bioperl warnings for Genebank files that are missing info by erix (Prior) on Oct 25, 2014 at 14:33 UTC
If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine... "not using Genbank" and "as long as I can use NCBI's FTP" seems a bit contradictory. GenBank is NCBI, no? Perhaps UniProt.org has what you want? (You haven't said what it is that you're after...) For example, 742726 being your tax_id (taxonomy ID), it delivers the proteins with http://www.uniprot.org/uniprot/?query=taxonomy:742726&sort=score I see there is a gff tab for download (easily turned into a URL). As far as I know, NCBI, Ensembl, and UniProt exchange sequences and annotation regularly.	[reply]
Re^4: Debugging Bioperl warnings for Genebank files that are missing info by Sosi (Sexton) on Oct 27, 2014 at 10:30 UTC
Re^3: Debugging Bioperl warnings for Genebank files that are missing info by Anonymous Monk on Oct 25, 2014 at 13:43 UTC
By converting the gb files to gff you're only getting a list of annotation information and coordinates about the genes within that species, so yes, the output will not be the same but that should not be a problem because having a fasta file handy you would parse that gff list for the CDS coordinates and extract the corresponding subsequences from the fasta file. You'll be surprised by the amount of incorrect deposition to genbank but that really depends on the interest on the organism, for instance, human and bacterial entries are more appropriately kept, if the deposited data are part of a published study then it is on the authors to share their data as a prerequisite to publishing, many of them don't have dedicated support to ensure that their data is of good integrity. You can always exclude an empty file if it doesn't pass your inclusion standards. EMBL and ENSEMBLE have an FTP facility as well and it seems they have information about the species you're studying. Check the file new_genomes.txt in this link. Explore it and if you've more queries you'd be better directing them at platforms such as www.biostars.org or seqanswers. As far as my approach is concerned, I guess I have suggested one of the simplest work-arounds since I've been repeatedly bitten by genbank data myself, but that doesn't mean you won't bump into other problems in the other databases though ;) A 4 year old monk	[reply] [d/l]