I am interested to retrieve the intergenic sequences (sequences between genes) and the
intron sequences (noncoding region of genes) separately of an organism using the accession
number from GenBank page (Nucleotide database) of NCBI. After searching for a perl script in
Google I could not find one to perform the task. Although I came across a few perl modules which
can retrieve the complete nucleotide sequence of the accession number e.g. NC_027152.1 Lens culinaris
cultivar Northfield chloroplast, complete genome. It has a sequence of 122967 bases with detailed annotations
and underlined links for genes indicating positions e.g. for gene psbA, complement(313..1374) and for gene trnK-UUU,
complement(join(1691..1719,4200..4236)). The bases from 1..312 and from 1375..1690 in the complementary sequence
are intergenic sequences. The word "Complement" stands for complementary sequence. But the bases from 1720..4199
is the intron sequence (intervening sequence) for the trnK-UUU gene.
Extracting the specific region using "Change region shown" on the right panel of GenBank page is a very tedious and
time-consuming process. If a perl script is written to extract the intergenic sequence and the intron sequence(s) of genes,
it will certainly save time for data collection. I welcome suggestions and guidance from Perl experts to retrieve the intergenic sequences and the intron sequences separately.
I have written a script that can retrieve the complete sequence of 122967 nucleotides. I have given the code below:
The GenBank page partly looks like (it is not the complete GenBank information for Acc No. NC_027152.1):
Lens culinaris cultivar Northfield chloroplast, complete genome
NCBI Reference Sequence: NC_027152.1
FASTA Graphics
LOCUS NC_027152 122967 bp DNA circular PLN 0
+3-JUN-2015
DEFINITION Lens culinaris cultivar Northfield chloroplast, complete g
+enome.
ACCESSION NC_027152
VERSION NC_027152.1
DBLINK BioProject: PRJNA285561
KEYWORDS RefSeq.
SOURCE chloroplast Lens culinaris (lentil)
ORGANISM Lens culinaris
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Trach
+eophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papiliono
+ideae;
Fabeae; Lens.
REFERENCE 1 (bases 1 to 122967)
AUTHORS Sveinsson,S. and Cronk,Q.
TITLE Delimitation of conserved gene clusters in the scrambled p
+lastomes
of the IRLC legumes (Fabaceae: Trifolieae, Fabeae)
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 122967)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (02-JUN-2015) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 122967)
AUTHORS Sveinsson,S. and Cronk,Q.
TITLE Direct Submission
JOURNAL Submitted (16-MAY-2014) Botany, University of British Colu
+mbia,
3529-6270 University Blvd, Vancouver, British Columbia V6T
+1Z4,
Canada
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject t
+o final
NCBI review. The reference sequence is identical to KJ8502
+39.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..122967
/organism="Lens culinaris"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/cultivar="Northfield"
/db_xref="taxon:3864"
gene complement(313..1374)
/gene="psbA"
/locus_tag="ABY07_gp001"
/db_xref="GeneID:24418176"
CDS complement(313..1374)
/gene="psbA"
/locus_tag="ABY07_gp001"
/codon_start=1
/transl_table=11
/product="photosystem II protein D1"
/protein_id="YP_009141518.1"
/db_xref="GeneID:24418176"
/translation="MTAILERRDSENLWGRFCNWITSTENRLYIGWFGV
+LMIPTLLTA
TSVFIIAFIAAPPVDIDGIREPVSGSLLYGNNIISGAIIPTSAAIGLHF
+YPIWEAASV
DEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMRPWIAVAYSAPV
+AAATAVFLI
YPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGS
+LFSAMHGSL
VTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNN
+SRSLHFFLA
AWPVVGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRAN
+LGMEVMHER
NAHNFPLDLAAVEAPSING"
gene complement(1691..4236)
/gene="trnK-UUU"
/locus_tag="ABY07_gt001"
/db_xref="GeneID:24418184"
tRNA complement(join(1691..1719,4200..4236))
/gene="trnK-UUU"
/locus_tag="ABY07_gt001"
/product="tRNA-Lys"
/note="anticodon:UUU"
/db_xref="GeneID:24418184"
gene complement(1967..3490)
/gene="matK"
/locus_tag="ABY07_gp074"
/db_xref="GeneID:24418113"
CDS complement(1967..3490)
/gene="matK"
/locus_tag="ABY07_gp074"
/codon_start=1
/transl_table=11
/product="maturase K"
/protein_id="YP_009141519.1"
/db_xref="GeneID:24418113"
/translation="MKESQVYLERARSRQQHFLYSLIFREYIYGLAYSH
+NLNRSLFVE
NVGYDNKYSLLIVKRLITRMYQQNHLIISANDSNKNSFWGYNNNYYSQI
+ISEGFSIVV
EIPFFLQLSSSLEEAEIIKYYKNFRSIHSIFPFLEDKFTYLNYVSDIRI
+PYPIHLEIL
VQILRYWVKDAPFFHLLRLFLCNWNSFITTKNKKSISTFSKINPRFFLF
+LYNFYVCEY
ESIFVFLRNQSSHLPLKSFRVFFERIFFYAKREHLVKLFAKDFLYTLTL
+TFFKDPNIH
YVRYQGKCILASKNAPFLMDKWKHYFIHLWQCFFDVWSQPRTININPLS
+EHSFKLLGY
FSNVRLNRSVVRSQMLQNTFLIEIVIKKIDIIVPILPLIRSLAKAKFCN
+VLGQPISKP
VWADSSDFDIIDRFLRISRNLSHYYKGSSKKKSLYRIKYILRLSCIKTL
+ACKHKSTVR
AFLKRSGSEEFLQEFFTEEEEILSLIFPRDSSTLERLSRNRIWYLDILF
+SNDLVHDE"
gene complement(4722..6149)
/gene="rbcL"
/locus_tag="ABY07_gp073"
/db_xref="GeneID:24418112"
CDS complement(4722..6149)
/gene="rbcL"
/locus_tag="ABY07_gp073"
/codon_start=1
/transl_table=11
/product="ribulose 1,5-bisphosphate carboxylase/o
+xygenase
large subunit"
/protein_id="YP_009141520.1"
/db_xref="GeneID:24418112"
/translation="MSPQTETKAKVGFQAGVKDYKLTYYTPEYQTKDTD
+ILAAFRVTP
QPGVPPEEAGAAVAAESSTGTWTTVWTDGLTSLDRYKGRCYEIEPVPGE
+DNQFIAYVA
YPLDLFEEGSVTNMFTSIVGNVFGFKALRALRLEDLRIPNAYVKTFQGP
+PHGIQVERD
KLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQ
+PFMRWRDRF
LFCAEAIYKSQAETGEIKGHYLNATAGTCEEMLKRAIFARELGVPIVMH
+DYLTGGFTA
NTTLSHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDH
+IHAGTVVGK
LEGEREITLGFVDLLRDDYIEKDRSRGIYFTQDWVSLPGVIPVASGGIH
+VWHMPALTE
IFGDDSVLQFGGGTLGHPWGNAPGAVANRVALEACVQARNEGRDLAREG
+NAIIREAGK
WSPELAAACEVWKEIKFEFPAMDTL"
gene 6916..8385
/gene="atpB"
/locus_tag="ABY07_gp072"
/db_xref="GeneID:24418114"
CDS 6916..8385
/gene="atpB"
/locus_tag="ABY07_gp072"
..................................... (Many lines omitted here)
122821 aaaagcttcg ggtaaatcac gaaagctacc gtaacagctg caacaggagt ctattata
+aa
122881 ttattttctc ttttttgttt taatagattc atgggcgaac gacgggaatt gaacc
+cgcgc
122941 atggtggatt cacaatccac tgccttg
//
My script goes like:
#!/usr/bin/perl
use warnings;
use strict;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Text::Wrap;
my $acc="NC_027152.1";
my $gb= new Bio::DB::GenBank;
my $seq1 = $gb->get_Seq_by_acc($acc);
my $sequence = $seq1->seq;
print "\n Complete sequence:
$sequence\n";
# code for intergenic sequence needed
# code for intron sequence needed
exit;
#################