supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:
I am interested to retrieve the intergenic sequences (sequences between genes) and the intron sequences (noncoding region of genes) separately of an organism using the accession number from GenBank page (Nucleotide database) of NCBI. After searching for a perl script in Google I could not find one to perform the task. Although I came across a few perl modules which can retrieve the complete nucleotide sequence of the accession number e.g. NC_027152.1 Lens culinaris cultivar Northfield chloroplast, complete genome. It has a sequence of 122967 bases with detailed annotations and underlined links for genes indicating positions e.g. for gene psbA, complement(313..1374) and for gene trnK-UUU, complement(join(1691..1719,4200..4236)). The bases from 1..312 and from 1375..1690 in the complementary sequence are intergenic sequences. The word "Complement" stands for complementary sequence. But the bases from 1720..4199 is the intron sequence (intervening sequence) for the trnK-UUU gene.
Extracting the specific region using "Change region shown" on the right panel of GenBank page is a very tedious and time-consuming process. If a perl script is written to extract the intergenic sequence and the intron sequence(s) of genes, it will certainly save time for data collection. I welcome suggestions and guidance from Perl experts to retrieve the intergenic sequences and the intron sequences separately.
I have written a script that can retrieve the complete sequence of 122967 nucleotides. I have given the code below:
The GenBank page partly looks like (it is not the complete GenBank information for Acc No. NC_027152.1):
My script goes like:Lens culinaris cultivar Northfield chloroplast, complete genome NCBI Reference Sequence: NC_027152.1 FASTA Graphics LOCUS NC_027152 122967 bp DNA circular PLN 0 +3-JUN-2015 DEFINITION Lens culinaris cultivar Northfield chloroplast, complete g +enome. ACCESSION NC_027152 VERSION NC_027152.1 DBLINK BioProject: PRJNA285561 KEYWORDS RefSeq. SOURCE chloroplast Lens culinaris (lentil) ORGANISM Lens culinaris Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Trach +eophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papiliono +ideae; Fabeae; Lens. REFERENCE 1 (bases 1 to 122967) AUTHORS Sveinsson,S. and Cronk,Q. TITLE Delimitation of conserved gene clusters in the scrambled p +lastomes of the IRLC legumes (Fabaceae: Trifolieae, Fabeae) JOURNAL Unpublished REFERENCE 2 (bases 1 to 122967) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (02-JUN-2015) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to 122967) AUTHORS Sveinsson,S. and Cronk,Q. TITLE Direct Submission JOURNAL Submitted (16-MAY-2014) Botany, University of British Colu +mbia, 3529-6270 University Blvd, Vancouver, British Columbia V6T +1Z4, Canada COMMENT PROVISIONAL REFSEQ: This record has not yet been subject t +o final NCBI review. The reference sequence is identical to KJ8502 +39. COMPLETENESS: full length. FEATURES Location/Qualifiers source 1..122967 /organism="Lens culinaris" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /cultivar="Northfield" /db_xref="taxon:3864" gene complement(313..1374) /gene="psbA" /locus_tag="ABY07_gp001" /db_xref="GeneID:24418176" CDS complement(313..1374) /gene="psbA" /locus_tag="ABY07_gp001" /codon_start=1 /transl_table=11 /product="photosystem II protein D1" /protein_id="YP_009141518.1" /db_xref="GeneID:24418176" /translation="MTAILERRDSENLWGRFCNWITSTENRLYIGWFGV +LMIPTLLTA TSVFIIAFIAAPPVDIDGIREPVSGSLLYGNNIISGAIIPTSAAIGLHF +YPIWEAASV DEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMRPWIAVAYSAPV +AAATAVFLI YPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGS +LFSAMHGSL VTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNN +SRSLHFFLA AWPVVGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRAN +LGMEVMHER NAHNFPLDLAAVEAPSING" gene complement(1691..4236) /gene="trnK-UUU" /locus_tag="ABY07_gt001" /db_xref="GeneID:24418184" tRNA complement(join(1691..1719,4200..4236)) /gene="trnK-UUU" /locus_tag="ABY07_gt001" /product="tRNA-Lys" /note="anticodon:UUU" /db_xref="GeneID:24418184" gene complement(1967..3490) /gene="matK" /locus_tag="ABY07_gp074" /db_xref="GeneID:24418113" CDS complement(1967..3490) /gene="matK" /locus_tag="ABY07_gp074" /codon_start=1 /transl_table=11 /product="maturase K" /protein_id="YP_009141519.1" /db_xref="GeneID:24418113" /translation="MKESQVYLERARSRQQHFLYSLIFREYIYGLAYSH +NLNRSLFVE NVGYDNKYSLLIVKRLITRMYQQNHLIISANDSNKNSFWGYNNNYYSQI +ISEGFSIVV EIPFFLQLSSSLEEAEIIKYYKNFRSIHSIFPFLEDKFTYLNYVSDIRI +PYPIHLEIL VQILRYWVKDAPFFHLLRLFLCNWNSFITTKNKKSISTFSKINPRFFLF +LYNFYVCEY ESIFVFLRNQSSHLPLKSFRVFFERIFFYAKREHLVKLFAKDFLYTLTL +TFFKDPNIH YVRYQGKCILASKNAPFLMDKWKHYFIHLWQCFFDVWSQPRTININPLS +EHSFKLLGY FSNVRLNRSVVRSQMLQNTFLIEIVIKKIDIIVPILPLIRSLAKAKFCN +VLGQPISKP VWADSSDFDIIDRFLRISRNLSHYYKGSSKKKSLYRIKYILRLSCIKTL +ACKHKSTVR AFLKRSGSEEFLQEFFTEEEEILSLIFPRDSSTLERLSRNRIWYLDILF +SNDLVHDE" gene complement(4722..6149) /gene="rbcL" /locus_tag="ABY07_gp073" /db_xref="GeneID:24418112" CDS complement(4722..6149) /gene="rbcL" /locus_tag="ABY07_gp073" /codon_start=1 /transl_table=11 /product="ribulose 1,5-bisphosphate carboxylase/o +xygenase large subunit" /protein_id="YP_009141520.1" /db_xref="GeneID:24418112" /translation="MSPQTETKAKVGFQAGVKDYKLTYYTPEYQTKDTD +ILAAFRVTP QPGVPPEEAGAAVAAESSTGTWTTVWTDGLTSLDRYKGRCYEIEPVPGE +DNQFIAYVA YPLDLFEEGSVTNMFTSIVGNVFGFKALRALRLEDLRIPNAYVKTFQGP +PHGIQVERD KLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQ +PFMRWRDRF LFCAEAIYKSQAETGEIKGHYLNATAGTCEEMLKRAIFARELGVPIVMH +DYLTGGFTA NTTLSHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDH +IHAGTVVGK LEGEREITLGFVDLLRDDYIEKDRSRGIYFTQDWVSLPGVIPVASGGIH +VWHMPALTE IFGDDSVLQFGGGTLGHPWGNAPGAVANRVALEACVQARNEGRDLAREG +NAIIREAGK WSPELAAACEVWKEIKFEFPAMDTL" gene 6916..8385 /gene="atpB" /locus_tag="ABY07_gp072" /db_xref="GeneID:24418114" CDS 6916..8385 /gene="atpB" /locus_tag="ABY07_gp072" ..................................... (Many lines omitted here) 122821 aaaagcttcg ggtaaatcac gaaagctacc gtaacagctg caacaggagt ctattata +aa 122881 ttattttctc ttttttgttt taatagattc atgggcgaac gacgggaatt gaacc +cgcgc 122941 atggtggatt cacaatccac tgccttg //
#!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; use Text::Wrap; my $acc="NC_027152.1"; my $gb= new Bio::DB::GenBank; my $seq1 = $gb->get_Seq_by_acc($acc); my $sequence = $seq1->seq; print "\n Complete sequence: $sequence\n"; # code for intergenic sequence needed # code for intron sequence needed exit; #################
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI?
by BrowserUk (Patriarch) on Mar 24, 2018 at 06:56 UTC | |
by supriyoch_2008 (Monk) on Mar 24, 2018 at 10:08 UTC | |
|
Re: How to retrieve the intergenic sequences and the introns from GenBank page of NCBI?
by poj (Abbot) on Mar 24, 2018 at 10:07 UTC |