Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Debugging Bioperl warnings for Genebank files that are missing info

by Sosi (Sexton)
on Oct 23, 2014 at 22:33 UTC ( [id://1104819]=perlquestion: print w/replies, xml ) Need Help??

Sosi has asked for the wisdom of the Perl Monks concerning the following question:

Oh thy masters of Perl wisdom, please enlighten me. I am struggling with some problems reading Genebank files (.gbff) through Bioperl.

I am trying to extract CDS and translation sequences using $feat->spliced_seq->seq and $feat->get_tag_values("translation")). My problem is that many of the genebank files are incomplete or are not matching the "correct" (example) format:

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1 +999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and +Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ... JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale Univer +sity, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYN...RTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of +S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQL...GRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEK...IFGSLF" ORIGIN 1 gatcctccat ata...tgatc //

If all files were in the correct format, it would be relatively straightforward to extract FASTA files with each gene or protein in the format

>FEAT | FEAT | FEAT | FEAT agcgtgacgattvcatgt...catgcatgcag

and

>FEAT | FEAT | FEAT | FEAT AYTKRCL...MRTCKYS

Many of the files that I have either do not have the "Origin" field at the bottom (example), or have multiple "Origin" fields (example), each just after a "CDS" field, resulting in warnings and die errors that prevent me from doing what I need to do. Most of the warnings indicate that Bioperl hasn't been able to infer the sequence (because they are lacking that "ORIGIN" field)

So my questions are the following:

1. Could you give me any tips on how I can find which of the files have this incorrect file format? I am figuring that a if($feat->spliced_seq->seq) fails, push those filenames to a list and manually download them again :( But I haven't been able to test this correctly yet, and maybe there is something already in Bioperl for these cases?

2. How can I prevent the automatic die everytime a warning comes out, so that I can find the whole list of files that is not designed as it should? Curiously, through the ~1000 files that I am running, the script runs for a few hundreds, outputing those errors but quits at some point. I must say that I have use autodie; in the preamble, but I think the die command is being given by Bioperl.

Note: this was now (27 Oct) crossposted on Biostars in here

Replies are listed 'Best First'.
Re: Debugging Bioperl warnings for Genebank files that are missing info
by biohisham (Priest) on Oct 24, 2014 at 00:43 UTC

    Yes at times the genbank files can be problematic in that they are incomplete or that BioPerl gets cranky, you have not provided a code that I can test but if you may consider the following workaround, work with the fasta files in conjunction with the feature table provided in the genbank files

    • convert the genbank to gff through (genbank2gff3.pl)
    • convert the genbank files to fasta or download the fasta equivalent
    • parse the gff files and extract the CDs with their coordinate information

      perl -F'\t' -lane 'if($F[2] eq "CDS"){print}' GCA_000153565.1_ASM15356v1_genomic.gbff.gff | cut -f3,4,5,7 > GCA_000153565.1_ASM15356v1_genomic.coordinates.txt

    • extract the subsequences from the fasta files using the coordinates saved in GCA_000153565.1_ASM15356v1_genomic.coordinates.txt

    For the last item you may use BioPerl::SeqIO $seq->subseq(start..stop) but make sure you get the reverse translation of the seqs in the negative strand


    A 4 year old monk

      Thanks! I can get your two first points by simply downloading the files from where I got the .gbff (i.e. this link has the .gbff, .gff, and .faa files for one of the organisms), though I can see that using genbank2gff3.pl is probably easier/faster but I'm not sure the output will be the same. I believe BioPerl::SeqIO already takes into account the reverse complement there, but I'll make sure.

      More importantly, if those .gbff don't have the sequences as in my examples, extracting them will not be possible anyway.

      I am wondering why many of these files are wrongly deposited in the first place. And if they are, why isn't this automatically corrected by NCBI itself... I was expecting this to be much easier than it is proving to be. but I guess this isn't the place to rant about this eh :)

      Off topic:If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...

        If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...

        "not using Genbank" and "as long as I can use NCBI's FTP" seems a bit contradictory. GenBank is NCBI, no?

        Perhaps UniProt.org has what you want? (You haven't said what it is that you're after...)

        For example, 742726 being your tax_id (taxonomy ID), it delivers the proteins with http://www.uniprot.org/uniprot/?query=taxonomy:742726&sort=score

        I see there is a gff tab for download (easily turned into a URL).

        As far as I know, NCBI, Ensembl, and UniProt exchange sequences and annotation regularly.

        By converting the gb files to gff you're only getting a list of annotation information and coordinates about the genes within that species, so yes, the output will not be the same but that should not be a problem because having a fasta file handy you would parse that gff list for the CDS coordinates and extract the corresponding subsequences from the fasta file.

        You'll be surprised by the amount of incorrect deposition to genbank but that really depends on the interest on the organism, for instance, human and bacterial entries are more appropriately kept, if the deposited data are part of a published study then it is on the authors to share their data as a prerequisite to publishing, many of them don't have dedicated support to ensure that their data is of good integrity. You can always exclude an empty file if it doesn't pass your inclusion standards.

        EMBL and ENSEMBLE have an FTP facility as well and it seems they have information about the species you're studying. Check the file new_genomes.txt in this link. Explore it and if you've more queries you'd be better directing them at platforms such as www.biostars.org or seqanswers. As far as my approach is concerned, I guess I have suggested one of the simplest work-arounds since I've been repeatedly bitten by genbank data myself, but that doesn't mean you won't bump into other problems in the other databases though ;)


        A 4 year old monk

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1104819]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 18:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found