How to retrieve the intergenic sequences and the introns from GenBank page of NCBI?

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

I am interested to retrieve the intergenic sequences (sequences between genes) and the intron sequences (noncoding region of genes) separately of an organism using the accession number from GenBank page (Nucleotide database) of NCBI. After searching for a perl script in Google I could not find one to perform the task. Although I came across a few perl modules which can retrieve the complete nucleotide sequence of the accession number e.g. NC_027152.1 Lens culinaris cultivar Northfield chloroplast, complete genome. It has a sequence of 122967 bases with detailed annotations and underlined links for genes indicating positions e.g. for gene psbA, complement(313..1374) and for gene trnK-UUU, complement(join(1691..1719,4200..4236)). The bases from 1..312 and from 1375..1690 in the complementary sequence are intergenic sequences. The word "Complement" stands for complementary sequence. But the bases from 1720..4199 is the intron sequence (intervening sequence) for the trnK-UUU gene.

Extracting the specific region using "Change region shown" on the right panel of GenBank page is a very tedious and time-consuming process. If a perl script is written to extract the intergenic sequence and the intron sequence(s) of genes, it will certainly save time for data collection. I welcome suggestions and guidance from Perl experts to retrieve the intergenic sequences and the intron sequences separately.

I have written a script that can retrieve the complete sequence of 122967 nucleotides. I have given the code below:

The GenBank page partly looks like (it is not the complete GenBank information for Acc No. NC_027152.1):

 
Lens culinaris cultivar Northfield chloroplast, complete genome
NCBI Reference Sequence: NC_027152.1
FASTA Graphics
 LOCUS       NC_027152             122967 bp    DNA     circular PLN 0
+3-JUN-2015
DEFINITION  Lens culinaris cultivar Northfield chloroplast, complete g
+enome.
ACCESSION   NC_027152
VERSION     NC_027152.1
DBLINK      BioProject: PRJNA285561
KEYWORDS    RefSeq.
SOURCE      chloroplast Lens culinaris (lentil)
  ORGANISM  Lens culinaris
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Trach
+eophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papiliono
+ideae;
            Fabeae; Lens.
REFERENCE   1  (bases 1 to 122967)
  AUTHORS   Sveinsson,S. and Cronk,Q.
  TITLE     Delimitation of conserved gene clusters in the scrambled p
+lastomes
            of the IRLC legumes (Fabaceae: Trifolieae, Fabeae)
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 122967)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (02-JUN-2015) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 122967)
  AUTHORS   Sveinsson,S. and Cronk,Q.
  TITLE     Direct Submission
  JOURNAL   Submitted (16-MAY-2014) Botany, University of British Colu
+mbia,
            3529-6270 University Blvd, Vancouver, British Columbia V6T
+1Z4,
            Canada
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject t
+o final
            NCBI review. The reference sequence is identical to KJ8502
+39.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..122967
                     /organism="Lens culinaris"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /cultivar="Northfield"
                     /db_xref="taxon:3864"
     gene            complement(313..1374)
                     /gene="psbA"
                     /locus_tag="ABY07_gp001"
                     /db_xref="GeneID:24418176"
     CDS             complement(313..1374)
                     /gene="psbA"
                     /locus_tag="ABY07_gp001"
                     /codon_start=1
                     /transl_table=11
                     /product="photosystem II protein D1"
                     /protein_id="YP_009141518.1"
                     /db_xref="GeneID:24418176"
                     /translation="MTAILERRDSENLWGRFCNWITSTENRLYIGWFGV
+LMIPTLLTA
                     TSVFIIAFIAAPPVDIDGIREPVSGSLLYGNNIISGAIIPTSAAIGLHF
+YPIWEAASV
                     DEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMRPWIAVAYSAPV
+AAATAVFLI
                     YPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGS
+LFSAMHGSL
                     VTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNN
+SRSLHFFLA
                     AWPVVGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRAN
+LGMEVMHER
                     NAHNFPLDLAAVEAPSING"
     gene            complement(1691..4236)
                     /gene="trnK-UUU"
                     /locus_tag="ABY07_gt001"
                     /db_xref="GeneID:24418184"
     tRNA            complement(join(1691..1719,4200..4236))
                     /gene="trnK-UUU"
                     /locus_tag="ABY07_gt001"
                     /product="tRNA-Lys"
                     /note="anticodon:UUU"
                     /db_xref="GeneID:24418184"
     gene            complement(1967..3490)
                     /gene="matK"
                     /locus_tag="ABY07_gp074"
                     /db_xref="GeneID:24418113"
     CDS             complement(1967..3490)
                     /gene="matK"
                     /locus_tag="ABY07_gp074"
                     /codon_start=1
                     /transl_table=11
                     /product="maturase K"
                     /protein_id="YP_009141519.1"
                     /db_xref="GeneID:24418113"
                     /translation="MKESQVYLERARSRQQHFLYSLIFREYIYGLAYSH
+NLNRSLFVE
                     NVGYDNKYSLLIVKRLITRMYQQNHLIISANDSNKNSFWGYNNNYYSQI
+ISEGFSIVV
                     EIPFFLQLSSSLEEAEIIKYYKNFRSIHSIFPFLEDKFTYLNYVSDIRI
+PYPIHLEIL
                     VQILRYWVKDAPFFHLLRLFLCNWNSFITTKNKKSISTFSKINPRFFLF
+LYNFYVCEY
                     ESIFVFLRNQSSHLPLKSFRVFFERIFFYAKREHLVKLFAKDFLYTLTL
+TFFKDPNIH
                     YVRYQGKCILASKNAPFLMDKWKHYFIHLWQCFFDVWSQPRTININPLS
+EHSFKLLGY
                     FSNVRLNRSVVRSQMLQNTFLIEIVIKKIDIIVPILPLIRSLAKAKFCN
+VLGQPISKP
                     VWADSSDFDIIDRFLRISRNLSHYYKGSSKKKSLYRIKYILRLSCIKTL
+ACKHKSTVR
                     AFLKRSGSEEFLQEFFTEEEEILSLIFPRDSSTLERLSRNRIWYLDILF
+SNDLVHDE"
     gene            complement(4722..6149)
                     /gene="rbcL"
                     /locus_tag="ABY07_gp073"
                     /db_xref="GeneID:24418112"
     CDS             complement(4722..6149)
                     /gene="rbcL"
                     /locus_tag="ABY07_gp073"
                     /codon_start=1
                     /transl_table=11
                     /product="ribulose 1,5-bisphosphate carboxylase/o
+xygenase
                     large subunit"
                     /protein_id="YP_009141520.1"
                     /db_xref="GeneID:24418112"
                     /translation="MSPQTETKAKVGFQAGVKDYKLTYYTPEYQTKDTD
+ILAAFRVTP
                     QPGVPPEEAGAAVAAESSTGTWTTVWTDGLTSLDRYKGRCYEIEPVPGE
+DNQFIAYVA
                     YPLDLFEEGSVTNMFTSIVGNVFGFKALRALRLEDLRIPNAYVKTFQGP
+PHGIQVERD
                     KLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQ
+PFMRWRDRF
                     LFCAEAIYKSQAETGEIKGHYLNATAGTCEEMLKRAIFARELGVPIVMH
+DYLTGGFTA
                     NTTLSHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDH
+IHAGTVVGK
                     LEGEREITLGFVDLLRDDYIEKDRSRGIYFTQDWVSLPGVIPVASGGIH
+VWHMPALTE
                     IFGDDSVLQFGGGTLGHPWGNAPGAVANRVALEACVQARNEGRDLAREG
+NAIIREAGK
                     WSPELAAACEVWKEIKFEFPAMDTL"
     gene            6916..8385
                     /gene="atpB"
                     /locus_tag="ABY07_gp072"
                     /db_xref="GeneID:24418114"
     CDS             6916..8385
                     /gene="atpB"
                     /locus_tag="ABY07_gp072" 

     ..................................... (Many lines omitted here) 

122821 aaaagcttcg ggtaaatcac gaaagctacc gtaacagctg caacaggagt ctattata
+aa
   122881 ttattttctc ttttttgttt taatagattc atgggcgaac gacgggaatt gaacc
+cgcgc
   122941 atggtggatt cacaatccac tgccttg
//
[download]

My script goes like:

 #!/usr/bin/perl 
use warnings;
use strict; 

use Bio::DB::GenBank;
use Bio::SeqIO;  
use Text::Wrap; 

 my $acc="NC_027152.1"; 

 my $gb= new Bio::DB::GenBank;
 my $seq1 = $gb->get_Seq_by_acc($acc);
 my $sequence = $seq1->seq; 

print "\n Complete sequence:
$sequence\n"; 

# code for intergenic sequence needed

# code for intron sequence needed

exit; 
#################
[download]

Comment on How to retrieve the intergenic sequences and the introns from GenBank page of NCBI? Select or Download Code

Replies are listed 'Best First'.
Re: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI? by BrowserUk (Patriarch) on Mar 24, 2018 at 06:56 UTC
Have you seen this page (and the other related chapters) at the GenBank website? In a nutshell, most of what you can download from the JavaScript heavy main search results page, can also be downloaded directly by adding the selection parameters to the appropriate URL, and GETing that url. It's much easier than automating a js driven webpage. I don't know if there is a url/parameter set for retrieving the stuff you are after, but it would be worth your time to take a look. They also provide sample Perl code for doing most of the work. (I used an earlier version for some stuff a few years ago, but it has all change since so I can't be more specific.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply]
Re^2: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI? by supriyoch_2008 (Monk) on Mar 24, 2018 at 10:08 UTC
BrowserUk Thank you very much for your suggestions. I shall go through the page. With regards,	[reply]
Re: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI? by poj (Abbot) on Mar 24, 2018 at 10:07 UTC
Perhaps you just need to use substr #!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; my $acc = "NC_027152.1"; my $gb = new Bio::DB::GenBank; my $seq1 = $gb->get_Seq_by_acc($acc); my $sequence = $seq1->seq; region($sequence,313,1374); sub region { my ($string,$begin,$end) = @_; my $length = $end - $begin + 1; my $region = lc substr($string,$begin-1,$length); my @line; my $posn = 1; print "Posn $begin - $end length $length\n"; while ($region =~ /(.{10}\|.{1,9}$)/g){ push @line,$1; if (@line == 6){ printf "%4d %s\n",$posn,join ' ',@line; $posn += @line * 10; @line=(); } } printf "%4d %s\n",$posn,join ' ',@line if (@line); } [download] poj	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks