Thanks! I can get your two first points by simply downloading the files from where I got the .gbff (i.e. this link has the .gbff, .gff, and .faa files for one of the organisms), though I can see that using genbank2gff3.pl is probably easier/faster but I'm not sure the output will be the same. I believe BioPerl::SeqIO already takes into account the reverse complement there, but I'll make sure.
More importantly, if those .gbff don't have the sequences as in my examples, extracting them will not be possible anyway.
I am wondering why many of these files are wrongly deposited in the first place. And if they are, why isn't this automatically corrected by NCBI itself... I was expecting this to be much easier than it is proving to be. but I guess this isn't the place to rant about this eh :)
Off topic:If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...
|