Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to retrieve the intergenic sequences and the introns from GenBank page of NCBI?

by supriyoch_2008 (Monk)
on Mar 24, 2018 at 04:46 UTC ( [id://1211646]=perlquestion: print w/replies, xml ) Need Help??

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

I am interested to retrieve the intergenic sequences (sequences between genes) and the intron sequences (noncoding region of genes) separately of an organism using the accession number from GenBank page (Nucleotide database) of NCBI. After searching for a perl script in Google I could not find one to perform the task. Although I came across a few perl modules which can retrieve the complete nucleotide sequence of the accession number e.g. NC_027152.1 Lens culinaris cultivar Northfield chloroplast, complete genome. It has a sequence of 122967 bases with detailed annotations and underlined links for genes indicating positions e.g. for gene psbA, complement(313..1374) and for gene trnK-UUU, complement(join(1691..1719,4200..4236)). The bases from 1..312 and from 1375..1690 in the complementary sequence are intergenic sequences. The word "Complement" stands for complementary sequence. But the bases from 1720..4199 is the intron sequence (intervening sequence) for the trnK-UUU gene.

Extracting the specific region using "Change region shown" on the right panel of GenBank page is a very tedious and time-consuming process. If a perl script is written to extract the intergenic sequence and the intron sequence(s) of genes, it will certainly save time for data collection. I welcome suggestions and guidance from Perl experts to retrieve the intergenic sequences and the intron sequences separately.

I have written a script that can retrieve the complete sequence of 122967 nucleotides. I have given the code below:

The GenBank page partly looks like (it is not the complete GenBank information for Acc No. NC_027152.1):

Lens culinaris cultivar Northfield chloroplast, complete genome NCBI Reference Sequence: NC_027152.1 FASTA Graphics LOCUS NC_027152 122967 bp DNA circular PLN 0 +3-JUN-2015 DEFINITION Lens culinaris cultivar Northfield chloroplast, complete g +enome. ACCESSION NC_027152 VERSION NC_027152.1 DBLINK BioProject: PRJNA285561 KEYWORDS RefSeq. SOURCE chloroplast Lens culinaris (lentil) ORGANISM Lens culinaris Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Trach +eophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papiliono +ideae; Fabeae; Lens. REFERENCE 1 (bases 1 to 122967) AUTHORS Sveinsson,S. and Cronk,Q. TITLE Delimitation of conserved gene clusters in the scrambled p +lastomes of the IRLC legumes (Fabaceae: Trifolieae, Fabeae) JOURNAL Unpublished REFERENCE 2 (bases 1 to 122967) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (02-JUN-2015) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to 122967) AUTHORS Sveinsson,S. and Cronk,Q. TITLE Direct Submission JOURNAL Submitted (16-MAY-2014) Botany, University of British Colu +mbia, 3529-6270 University Blvd, Vancouver, British Columbia V6T +1Z4, Canada COMMENT PROVISIONAL REFSEQ: This record has not yet been subject t +o final NCBI review. The reference sequence is identical to KJ8502 +39. COMPLETENESS: full length. FEATURES Location/Qualifiers source 1..122967 /organism="Lens culinaris" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /cultivar="Northfield" /db_xref="taxon:3864" gene complement(313..1374) /gene="psbA" /locus_tag="ABY07_gp001" /db_xref="GeneID:24418176" CDS complement(313..1374) /gene="psbA" /locus_tag="ABY07_gp001" /codon_start=1 /transl_table=11 /product="photosystem II protein D1" /protein_id="YP_009141518.1" /db_xref="GeneID:24418176" /translation="MTAILERRDSENLWGRFCNWITSTENRLYIGWFGV +LMIPTLLTA TSVFIIAFIAAPPVDIDGIREPVSGSLLYGNNIISGAIIPTSAAIGLHF +YPIWEAASV DEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMRPWIAVAYSAPV +AAATAVFLI YPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGS +LFSAMHGSL VTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNN +SRSLHFFLA AWPVVGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRAN +LGMEVMHER NAHNFPLDLAAVEAPSING" gene complement(1691..4236) /gene="trnK-UUU" /locus_tag="ABY07_gt001" /db_xref="GeneID:24418184" tRNA complement(join(1691..1719,4200..4236)) /gene="trnK-UUU" /locus_tag="ABY07_gt001" /product="tRNA-Lys" /note="anticodon:UUU" /db_xref="GeneID:24418184" gene complement(1967..3490) /gene="matK" /locus_tag="ABY07_gp074" /db_xref="GeneID:24418113" CDS complement(1967..3490) /gene="matK" /locus_tag="ABY07_gp074" /codon_start=1 /transl_table=11 /product="maturase K" /protein_id="YP_009141519.1" /db_xref="GeneID:24418113" /translation="MKESQVYLERARSRQQHFLYSLIFREYIYGLAYSH +NLNRSLFVE NVGYDNKYSLLIVKRLITRMYQQNHLIISANDSNKNSFWGYNNNYYSQI +ISEGFSIVV EIPFFLQLSSSLEEAEIIKYYKNFRSIHSIFPFLEDKFTYLNYVSDIRI +PYPIHLEIL VQILRYWVKDAPFFHLLRLFLCNWNSFITTKNKKSISTFSKINPRFFLF +LYNFYVCEY ESIFVFLRNQSSHLPLKSFRVFFERIFFYAKREHLVKLFAKDFLYTLTL +TFFKDPNIH YVRYQGKCILASKNAPFLMDKWKHYFIHLWQCFFDVWSQPRTININPLS +EHSFKLLGY FSNVRLNRSVVRSQMLQNTFLIEIVIKKIDIIVPILPLIRSLAKAKFCN +VLGQPISKP VWADSSDFDIIDRFLRISRNLSHYYKGSSKKKSLYRIKYILRLSCIKTL +ACKHKSTVR AFLKRSGSEEFLQEFFTEEEEILSLIFPRDSSTLERLSRNRIWYLDILF +SNDLVHDE" gene complement(4722..6149) /gene="rbcL" /locus_tag="ABY07_gp073" /db_xref="GeneID:24418112" CDS complement(4722..6149) /gene="rbcL" /locus_tag="ABY07_gp073" /codon_start=1 /transl_table=11 /product="ribulose 1,5-bisphosphate carboxylase/o +xygenase large subunit" /protein_id="YP_009141520.1" /db_xref="GeneID:24418112" /translation="MSPQTETKAKVGFQAGVKDYKLTYYTPEYQTKDTD +ILAAFRVTP QPGVPPEEAGAAVAAESSTGTWTTVWTDGLTSLDRYKGRCYEIEPVPGE +DNQFIAYVA YPLDLFEEGSVTNMFTSIVGNVFGFKALRALRLEDLRIPNAYVKTFQGP +PHGIQVERD KLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQ +PFMRWRDRF LFCAEAIYKSQAETGEIKGHYLNATAGTCEEMLKRAIFARELGVPIVMH +DYLTGGFTA NTTLSHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDH +IHAGTVVGK LEGEREITLGFVDLLRDDYIEKDRSRGIYFTQDWVSLPGVIPVASGGIH +VWHMPALTE IFGDDSVLQFGGGTLGHPWGNAPGAVANRVALEACVQARNEGRDLAREG +NAIIREAGK WSPELAAACEVWKEIKFEFPAMDTL" gene 6916..8385 /gene="atpB" /locus_tag="ABY07_gp072" /db_xref="GeneID:24418114" CDS 6916..8385 /gene="atpB" /locus_tag="ABY07_gp072" ..................................... (Many lines omitted here) 122821 aaaagcttcg ggtaaatcac gaaagctacc gtaacagctg caacaggagt ctattata +aa 122881 ttattttctc ttttttgttt taatagattc atgggcgaac gacgggaatt gaacc +cgcgc 122941 atggtggatt cacaatccac tgccttg //
My script goes like:
#!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; use Text::Wrap; my $acc="NC_027152.1"; my $gb= new Bio::DB::GenBank; my $seq1 = $gb->get_Seq_by_acc($acc); my $sequence = $seq1->seq; print "\n Complete sequence: $sequence\n"; # code for intergenic sequence needed # code for intron sequence needed exit; #################

Replies are listed 'Best First'.
Re: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI?
by BrowserUk (Patriarch) on Mar 24, 2018 at 06:56 UTC

    Have you seen this page (and the other related chapters) at the GenBank website?

    In a nutshell, most of what you can download from the JavaScript heavy main search results page, can also be downloaded directly by adding the selection parameters to the appropriate URL, and GETing that url.

    It's much easier than automating a js driven webpage.

    I don't know if there is a url/parameter set for retrieving the stuff you are after, but it would be worth your time to take a look. They also provide sample Perl code for doing most of the work.

    (I used an earlier version for some stuff a few years ago, but it has all change since so I can't be more specific.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      BrowserUk

      Thank you very much for your suggestions. I shall go through the page.

      With regards,

Re: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI?
by poj (Abbot) on Mar 24, 2018 at 10:07 UTC

    Perhaps you just need to use substr

    #!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; my $acc = "NC_027152.1"; my $gb = new Bio::DB::GenBank; my $seq1 = $gb->get_Seq_by_acc($acc); my $sequence = $seq1->seq; region($sequence,313,1374); sub region { my ($string,$begin,$end) = @_; my $length = $end - $begin + 1; my $region = lc substr($string,$begin-1,$length); my @line; my $posn = 1; print "Posn $begin - $end length $length\n"; while ($region =~ /(.{10}|.{1,9}$)/g){ push @line,$1; if (@line == 6){ printf "%4d %s\n",$posn,join ' ',@line; $posn += @line * 10; @line=(); } } printf "%4d %s\n",$posn,join ' ',@line if (@line); }
    poj

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1211646]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-24 02:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found