Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

I am interested in retrieving the coding sequence (cds) of a gene using accession number in a perl script. I searched for a script in the web and found a script that retrieves the entire sequence in FASTA format (not the cds directly). In fact, there is a hyperlink at the word CDS in the particular GenBank page of NCBI and when the link is clicked it shows the cds of the gene. Is it possible to get the cds sequence directly using a perl script and internet connectivity? I welcome suggestions from the perlmonks. Here goes the script:

 #########################################################
# Perl code to get complete sequence in FASTA format from GenBank page
#########################################################
#!/usr/bin/perl
use warnings;
use strict; 
use Bio::DB::GenBank;
use Bio::SeqIO;   
my $gb =new Bio::DB::GenBank; 
my $acc="NM_021817";  
my $seq1=$gb->get_Seq_by_acc($acc);
my $sequence=$seq1->seq();
my $description=$seq1->desc();
print "\n
$description
$sequence\n"; 
exit;
[download]

The script produces the following output in cmd screen:

 C:\Users\Desktop>seq.pl
Homo sapiens hyaluronan and proteoglycan link protein 2 (HAPLN2), mRNA
+.
TTGCCATTACCCACAGAATAAAGAAAGGGGCCCTGTTATTCAACAATAGGGGAAAAGACAGAGACAATGG
+GAAATTGTGC
TTCCGATGGGGTGGGGACTGAGAAGGAAAGGACAGACAGACAGACAGACAGGGGGTTGTACAGAAGAGGT
+CCGGTTTCTT
GAAGCAGCTGGAAGTCCTGGATAGTTCCCACCTGAAAGTCTGTTTGCAAAGGCAATGCGCACTCAGGCAC
+CAGAGGGCAG
AGGGGCTCAAGTTCCAGGGTTTTAAGGTGCTTGGAACTCCCAGGAGCCTGGCAAACCTTCATCCAGAACC
+TCTTCCTCAA
GCAAGACAAAAAGCTGCTAAGCACTGCTCCCTCCGTCTCTGTGAAGAGACCAGCTTCTAACAGACGGTGC
+CGGGCTGACC
CCCCATCATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTC
+CACAAAGCCC
AAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTCTCA
+TCGTGGGGCC
ACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGGAGCAAGGTGGAGC
+CTGGGGAGCT
CCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGTATGGGCCCCTGGGAGGGCGCGCC
+AGGATGCGGA
GGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGGCGTGCGCCTGGAGGACGAGGGCCGGTACCGCTG
+CGAGCTCATC
AACGGCATCGAGGACGAGAGCGTGGCGCTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCA
+GCCGGGGCCG
GTACCAGTTCAATTACTACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCC
+CAGCTCTACC
AGGCTTGGACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGT
+GCTCACCGCA
CGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCGACC
+GCTACGACGC
CTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGACGCTGTCTGAAGCC
+CACGCGGCGT
GCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCCGCCTGGAAGTTTTCGGGGCTAGA
+CCAGTGCGAC
GGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAATCACCACGCCGAGGCCGCGCTGCGGGGGGCTCC
+CGGATCCCGG
AGTGCGCAGTTTCGGCTTCCCCAGGCCCCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG
+GCGCCCACCG
TGTCCCCTCCAGCGCGCGCGAAGAAGCTTGGGAGTCGTGGCGGGGGTCTCTCGCCACCCCTTTCCGGAGA
+GCCTCCCCTC
CCTCCAGACCCGGAGCGGCCTCTCCAGACCTGCCTTCCCAGCCGGGGGCTGCGGGCCTCGGACCCCGGCT
+GGCCCGGCGG
CGGGGAGGGGAGGCGGGGGCGCCTCCGGCGGCGAGATGCAGAGGTGACCCTCGGACCCGCTGCCGTTCGC
+GAACCCTAGC
AGAGGACTCAGCCACCGCCGGGGGGAGGGTGAGGCGGCCGGGGGCATTAACTGACCTCTGAGTACAGCAA
+TAAAATAACC
TGGGGATCTTTAAAAAAAAAAAAAAAAAAAAAAAA
C:\Users\Desktop>
[download]

I would like to get the cds sequence as given below:

 >NM_021817.2:408-1430 Homo sapiens hyaluronan and proteoglycan link p
+rotein 2 (HAPLN2), mRNA
ATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTCCACAAAG
CCCAAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTC
TCATCGTGGGGCCACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGG
AGCAAGGTGGAGCCTGGGGAGCTCCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGT
ATGGGCCCCTGGGAGGGCGCGCCAGGATGCGGAGGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGG
CGTGCGCCTGGAGGACGAGGGCCGGTACCGCTGCGAGCTCATCAACGGCATCGAGGACGAGAGCGTGGCG
CTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCAGCCGGGGCCGGTACCAGTTCAATTACT
ACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCCCAGCTCTACCAGGCTTG
GACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGTGCTCACC
GCACGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCG
ACCGCTACGACGCCTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGAC
GCTGTCTGAAGCCCACGCGGCGTGCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCC
GCCTGGAAGTTTTCGGGGCTAGACCAGTGCGACGGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAA
TCACCACGCCGAGGCCGCGCTGCGGGGGGCTCCCGGATCCCGGAGTGCGCAGTTTCGGCTTCCCCAGGCC
CCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG
[download]

Comment on Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? Select or Download Code

Replies are listed 'Best First'.
Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by poj (Abbot) on Jan 23, 2017 at 21:31 UTC
I know next to nothing about bioinformatics but perhaps this suggestion helps. #!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; my $gb = new Bio::DB::GenBank; my $acc = "NM_021817"; my $seq1 = $gb->get_Seq_by_acc($acc); my $sequence = $seq1->seq; for my $feat ($seq1->get_SeqFeatures){ if ($feat->primary_tag eq 'CDS'){ print $feat->get_tag_values('product'),"\n"; print $feat->get_tag_values('gene'),"\n"; my $start = $feat->start; my $len = $feat->length; my $cds = substr($sequence,$start-1,$len); use Text::Wrap; # put this at top $Text::Wrap::columns = 71; print wrap('', '', $cds); } } [download] poj	[reply] [d/l]
Re^2: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by supriyoch_2008 (Monk) on Jan 24, 2017 at 14:23 UTC
Hi poj Thank you for your valuable suggestions. The code works nicely and my problem is now solved. With kind regards, supriyoch_2008	[reply]
Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by Lotus1 (Vicar) on Jan 23, 2017 at 20:22 UTC
In fact, there is a hyperlink at the word CDS in the particular GenBank page of NCBI and when the link is clicked it shows the cds of the gene. Is it possible to get the cds sequence directly using a perl script.... It sounds to me like you have a website that provides the 'cds' already prepared and you are asking us if there is a way to download the text from the webpage. If that is the case then yes you can do that. Given an URL there are several ways to download a page. On *nix platforms you can use wget with the perl system function or with `qx()`. From Perl directly you can use the LWP::Simple module. But it sounds like you need to give the number or name of the gene to get to the correct page before the CDS link appears. If that is the case you might be able to use WWW::Mechanize. It doesn't support JavaScript, hence the word 'might'. Are these pages open to the public? What is the CDS format? I didn't see it in a quick web search. Edit: I found a few solutions here that seem to do part of what you are requesting.	[reply] [d/l]
Re^2: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by supriyoch_2008 (Monk) on Jan 24, 2017 at 14:16 UTC
Hi Lotus1 Thank you very much for your suggestions. Regards. supriyoch_2008	[reply]
Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by ww (Archbishop) on Jan 23, 2017 at 16:33 UTC
I'm not at all sure anything you'll find on CPAN will be helpful, but there are a number of packages which seem related to your issue using the search terms: `site:CPAN.org cds GenBank`: My naive Google search results lists 153 possibilities. Spirit of the Monastery	[reply] [d/l]
Re^2: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by supriyoch_2008 (Monk) on Jan 24, 2017 at 14:20 UTC
Hi ww Thank you for informing me about 153 possibilities. I shall look into those to find the solution. Regards supriyoch_2008	[reply]
Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by crusty_collins (Friar) on Jan 23, 2017 at 19:20 UTC
There are some modules on CPAN for this. Just search google for perl coding sequence "We can't all be happy, we can't all be rich, we can't all be lucky � and it would be so much less fun if we were. There must be the dark background to show up the bright colours." Jean Rhys (1890-1979)	[reply]
Re^2: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by supriyoch_2008 (Monk) on Jan 24, 2017 at 14:28 UTC
Hi crusty_collins Thank you for your suggestions regarding modules on CPAN. I liked the quote of Jean Rhys that you cited. With deep regards, supriyoch_2008	[reply]


We don't bite newbies here... much
	PerlMonks