Hi Perlmonks,
I am interested in retrieving the coding sequence (cds) of a gene using accession number in a perl script. I searched for a script in the web and found a script that retrieves the entire sequence in FASTA format (not the cds directly). In fact, there is a hyperlink at the word CDS in the particular GenBank page of NCBI and when the link is clicked it shows the cds of the gene. Is it possible to get the cds sequence directly using a perl script and internet connectivity? I welcome suggestions from the perlmonks.
Here goes the script:
#########################################################
# Perl code to get complete sequence in FASTA format from GenBank page
#########################################################
#!/usr/bin/perl
use warnings;
use strict;
use Bio::DB::GenBank;
use Bio::SeqIO;
my $gb =new Bio::DB::GenBank;
my $acc="NM_021817";
my $seq1=$gb->get_Seq_by_acc($acc);
my $sequence=$seq1->seq();
my $description=$seq1->desc();
print "\n
$description
$sequence\n";
exit;
The script produces the following output in cmd screen:
C:\Users\Desktop>seq.pl
Homo sapiens hyaluronan and proteoglycan link protein 2 (HAPLN2), mRNA
+.
TTGCCATTACCCACAGAATAAAGAAAGGGGCCCTGTTATTCAACAATAGGGGAAAAGACAGAGACAATGG
+GAAATTGTGC
TTCCGATGGGGTGGGGACTGAGAAGGAAAGGACAGACAGACAGACAGACAGGGGGTTGTACAGAAGAGGT
+CCGGTTTCTT
GAAGCAGCTGGAAGTCCTGGATAGTTCCCACCTGAAAGTCTGTTTGCAAAGGCAATGCGCACTCAGGCAC
+CAGAGGGCAG
AGGGGCTCAAGTTCCAGGGTTTTAAGGTGCTTGGAACTCCCAGGAGCCTGGCAAACCTTCATCCAGAACC
+TCTTCCTCAA
GCAAGACAAAAAGCTGCTAAGCACTGCTCCCTCCGTCTCTGTGAAGAGACCAGCTTCTAACAGACGGTGC
+CGGGCTGACC
CCCCATCATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTC
+CACAAAGCCC
AAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTCTCA
+TCGTGGGGCC
ACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGGAGCAAGGTGGAGC
+CTGGGGAGCT
CCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGTATGGGCCCCTGGGAGGGCGCGCC
+AGGATGCGGA
GGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGGCGTGCGCCTGGAGGACGAGGGCCGGTACCGCTG
+CGAGCTCATC
AACGGCATCGAGGACGAGAGCGTGGCGCTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCA
+GCCGGGGCCG
GTACCAGTTCAATTACTACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCC
+CAGCTCTACC
AGGCTTGGACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGT
+GCTCACCGCA
CGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCGACC
+GCTACGACGC
CTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGACGCTGTCTGAAGCC
+CACGCGGCGT
GCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCCGCCTGGAAGTTTTCGGGGCTAGA
+CCAGTGCGAC
GGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAATCACCACGCCGAGGCCGCGCTGCGGGGGGCTCC
+CGGATCCCGG
AGTGCGCAGTTTCGGCTTCCCCAGGCCCCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG
+GCGCCCACCG
TGTCCCCTCCAGCGCGCGCGAAGAAGCTTGGGAGTCGTGGCGGGGGTCTCTCGCCACCCCTTTCCGGAGA
+GCCTCCCCTC
CCTCCAGACCCGGAGCGGCCTCTCCAGACCTGCCTTCCCAGCCGGGGGCTGCGGGCCTCGGACCCCGGCT
+GGCCCGGCGG
CGGGGAGGGGAGGCGGGGGCGCCTCCGGCGGCGAGATGCAGAGGTGACCCTCGGACCCGCTGCCGTTCGC
+GAACCCTAGC
AGAGGACTCAGCCACCGCCGGGGGGAGGGTGAGGCGGCCGGGGGCATTAACTGACCTCTGAGTACAGCAA
+TAAAATAACC
TGGGGATCTTTAAAAAAAAAAAAAAAAAAAAAAAA
C:\Users\Desktop>
I would like to get the cds sequence as given below:
>NM_021817.2:408-1430 Homo sapiens hyaluronan and proteoglycan link p
+rotein 2 (HAPLN2), mRNA
ATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTCCACAAAG
CCCAAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTC
TCATCGTGGGGCCACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGG
AGCAAGGTGGAGCCTGGGGAGCTCCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGT
ATGGGCCCCTGGGAGGGCGCGCCAGGATGCGGAGGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGG
CGTGCGCCTGGAGGACGAGGGCCGGTACCGCTGCGAGCTCATCAACGGCATCGAGGACGAGAGCGTGGCG
CTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCAGCCGGGGCCGGTACCAGTTCAATTACT
ACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCCCAGCTCTACCAGGCTTG
GACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGTGCTCACC
GCACGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCG
ACCGCTACGACGCCTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGAC
GCTGTCTGAAGCCCACGCGGCGTGCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCC
GCCTGGAAGTTTTCGGGGCTAGACCAGTGCGACGGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAA
TCACCACGCCGAGGCCGCGCTGCGGGGGGCTCCCGGATCCCGGAGTGCGCAGTTTCGGCTTCCCCAGGCC
CCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG