Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?

by supriyoch_2008 (Monk)
on Jan 23, 2017 at 14:28 UTC ( [id://1180157]=perlquestion: print w/replies, xml ) Need Help??

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

I am interested in retrieving the coding sequence (cds) of a gene using accession number in a perl script. I searched for a script in the web and found a script that retrieves the entire sequence in FASTA format (not the cds directly). In fact, there is a hyperlink at the word CDS in the particular GenBank page of NCBI and when the link is clicked it shows the cds of the gene. Is it possible to get the cds sequence directly using a perl script and internet connectivity? I welcome suggestions from the perlmonks. Here goes the script:

######################################################### # Perl code to get complete sequence in FASTA format from GenBank page ######################################################### #!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; my $gb =new Bio::DB::GenBank; my $acc="NM_021817"; my $seq1=$gb->get_Seq_by_acc($acc); my $sequence=$seq1->seq(); my $description=$seq1->desc(); print "\n $description $sequence\n"; exit;

The script produces the following output in cmd screen:

C:\Users\Desktop>seq.pl Homo sapiens hyaluronan and proteoglycan link protein 2 (HAPLN2), mRNA +. TTGCCATTACCCACAGAATAAAGAAAGGGGCCCTGTTATTCAACAATAGGGGAAAAGACAGAGACAATGG +GAAATTGTGC TTCCGATGGGGTGGGGACTGAGAAGGAAAGGACAGACAGACAGACAGACAGGGGGTTGTACAGAAGAGGT +CCGGTTTCTT GAAGCAGCTGGAAGTCCTGGATAGTTCCCACCTGAAAGTCTGTTTGCAAAGGCAATGCGCACTCAGGCAC +CAGAGGGCAG AGGGGCTCAAGTTCCAGGGTTTTAAGGTGCTTGGAACTCCCAGGAGCCTGGCAAACCTTCATCCAGAACC +TCTTCCTCAA GCAAGACAAAAAGCTGCTAAGCACTGCTCCCTCCGTCTCTGTGAAGAGACCAGCTTCTAACAGACGGTGC +CGGGCTGACC CCCCATCATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTC +CACAAAGCCC AAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTCTCA +TCGTGGGGCC ACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGGAGCAAGGTGGAGC +CTGGGGAGCT CCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGTATGGGCCCCTGGGAGGGCGCGCC +AGGATGCGGA GGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGGCGTGCGCCTGGAGGACGAGGGCCGGTACCGCTG +CGAGCTCATC AACGGCATCGAGGACGAGAGCGTGGCGCTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCA +GCCGGGGCCG GTACCAGTTCAATTACTACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCC +CAGCTCTACC AGGCTTGGACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGT +GCTCACCGCA CGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCGACC +GCTACGACGC CTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGACGCTGTCTGAAGCC +CACGCGGCGT GCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCCGCCTGGAAGTTTTCGGGGCTAGA +CCAGTGCGAC GGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAATCACCACGCCGAGGCCGCGCTGCGGGGGGCTCC +CGGATCCCGG AGTGCGCAGTTTCGGCTTCCCCAGGCCCCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG +GCGCCCACCG TGTCCCCTCCAGCGCGCGCGAAGAAGCTTGGGAGTCGTGGCGGGGGTCTCTCGCCACCCCTTTCCGGAGA +GCCTCCCCTC CCTCCAGACCCGGAGCGGCCTCTCCAGACCTGCCTTCCCAGCCGGGGGCTGCGGGCCTCGGACCCCGGCT +GGCCCGGCGG CGGGGAGGGGAGGCGGGGGCGCCTCCGGCGGCGAGATGCAGAGGTGACCCTCGGACCCGCTGCCGTTCGC +GAACCCTAGC AGAGGACTCAGCCACCGCCGGGGGGAGGGTGAGGCGGCCGGGGGCATTAACTGACCTCTGAGTACAGCAA +TAAAATAACC TGGGGATCTTTAAAAAAAAAAAAAAAAAAAAAAAA C:\Users\Desktop>

I would like to get the cds sequence as given below:

>NM_021817.2:408-1430 Homo sapiens hyaluronan and proteoglycan link p +rotein 2 (HAPLN2), mRNA ATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTCCACAAAG CCCAAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTC TCATCGTGGGGCCACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGG AGCAAGGTGGAGCCTGGGGAGCTCCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGT ATGGGCCCCTGGGAGGGCGCGCCAGGATGCGGAGGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGG CGTGCGCCTGGAGGACGAGGGCCGGTACCGCTGCGAGCTCATCAACGGCATCGAGGACGAGAGCGTGGCG CTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCAGCCGGGGCCGGTACCAGTTCAATTACT ACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCCCAGCTCTACCAGGCTTG GACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGTGCTCACC GCACGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCG ACCGCTACGACGCCTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGAC GCTGTCTGAAGCCCACGCGGCGTGCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCC GCCTGGAAGTTTTCGGGGCTAGACCAGTGCGACGGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAA TCACCACGCCGAGGCCGCGCTGCGGGGGGCTCCCGGATCCCGGAGTGCGCAGTTTCGGCTTCCCCAGGCC CCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG
  • Comment on Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?
  • Select or Download Code

Replies are listed 'Best First'.
Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?
by poj (Abbot) on Jan 23, 2017 at 21:31 UTC

    I know next to nothing about bioinformatics but perhaps this suggestion helps.

    #!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; my $gb = new Bio::DB::GenBank; my $acc = "NM_021817"; my $seq1 = $gb->get_Seq_by_acc($acc); my $sequence = $seq1->seq; for my $feat ($seq1->get_SeqFeatures){ if ($feat->primary_tag eq 'CDS'){ print $feat->get_tag_values('product'),"\n"; print $feat->get_tag_values('gene'),"\n"; my $start = $feat->start; my $len = $feat->length; my $cds = substr($sequence,$start-1,$len); use Text::Wrap; # put this at top $Text::Wrap::columns = 71; print wrap('', '', $cds); } }
    poj

      Hi poj

      Thank you for your valuable suggestions. The code works nicely and my problem is now solved.

      With kind regards,

      supriyoch_2008

Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?
by Lotus1 (Vicar) on Jan 23, 2017 at 20:22 UTC
    In fact, there is a hyperlink at the word CDS in the particular GenBank page of NCBI and when the link is clicked it shows the cds of the gene. Is it possible to get the cds sequence directly using a perl script....

    It sounds to me like you have a website that provides the 'cds' already prepared and you are asking us if there is a way to download the text from the webpage. If that is the case then yes you can do that. Given an URL there are several ways to download a page. On *nix platforms you can use wget with the perl system function or with qx(). From Perl directly you can use the LWP::Simple module.

    But it sounds like you need to give the number or name of the gene to get to the correct page before the CDS link appears. If that is the case you might be able to use WWW::Mechanize. It doesn't support JavaScript, hence the word 'might'. Are these pages open to the public? What is the CDS format? I didn't see it in a quick web search.

    Edit: I found a few solutions here that seem to do part of what you are requesting.

      Hi Lotus1

      Thank you very much for your suggestions. Regards.

      supriyoch_2008

Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?
by ww (Archbishop) on Jan 23, 2017 at 16:33 UTC

    I'm not at all sure anything you'll find on CPAN will be helpful, but there are a number of packages which seem related to your issue using the search terms: site:CPAN.org  cds GenBank: My naive Google search results lists 153 possibilities.

      Hi ww

      Thank you for informing me about 153 possibilities. I shall look into those to find the solution.

      Regards

      supriyoch_2008

Re: Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ?
by crusty_collins (Friar) on Jan 23, 2017 at 19:20 UTC
    There are some modules on CPAN for this.

    Just search google for perl coding sequence

    "We can't all be happy, we can't all be rich, we can't all be lucky – and it would be so much less fun if we were. There must be the dark background to show up the bright colours." Jean Rhys (1890-1979)

      Hi crusty_collins

      Thank you for your suggestions regarding modules on CPAN. I liked the quote of Jean Rhys that you cited.

      With deep regards,

      supriyoch_2008

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1180157]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-04-26 04:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found