Hi Perlmonks,

I am interested in retrieving the coding sequence (cds) of a gene using accession number in a perl script. I searched for a script in the web and found a script that retrieves the entire sequence in FASTA format (not the cds directly). In fact, there is a hyperlink at the word CDS in the particular GenBank page of NCBI and when the link is clicked it shows the cds of the gene. Is it possible to get the cds sequence directly using a perl script and internet connectivity? I welcome suggestions from the perlmonks. Here goes the script:

######################################################### # Perl code to get complete sequence in FASTA format from GenBank page ######################################################### #!/usr/bin/perl use warnings; use strict; use Bio::DB::GenBank; use Bio::SeqIO; my $gb =new Bio::DB::GenBank; my $acc="NM_021817"; my $seq1=$gb->get_Seq_by_acc($acc); my $sequence=$seq1->seq(); my $description=$seq1->desc(); print "\n $description $sequence\n"; exit;

The script produces the following output in cmd screen:

C:\Users\Desktop>seq.pl Homo sapiens hyaluronan and proteoglycan link protein 2 (HAPLN2), mRNA +. TTGCCATTACCCACAGAATAAAGAAAGGGGCCCTGTTATTCAACAATAGGGGAAAAGACAGAGACAATGG +GAAATTGTGC TTCCGATGGGGTGGGGACTGAGAAGGAAAGGACAGACAGACAGACAGACAGGGGGTTGTACAGAAGAGGT +CCGGTTTCTT GAAGCAGCTGGAAGTCCTGGATAGTTCCCACCTGAAAGTCTGTTTGCAAAGGCAATGCGCACTCAGGCAC +CAGAGGGCAG AGGGGCTCAAGTTCCAGGGTTTTAAGGTGCTTGGAACTCCCAGGAGCCTGGCAAACCTTCATCCAGAACC +TCTTCCTCAA GCAAGACAAAAAGCTGCTAAGCACTGCTCCCTCCGTCTCTGTGAAGAGACCAGCTTCTAACAGACGGTGC +CGGGCTGACC CCCCATCATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTC +CACAAAGCCC AAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTCTCA +TCGTGGGGCC ACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGGAGCAAGGTGGAGC +CTGGGGAGCT CCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGTATGGGCCCCTGGGAGGGCGCGCC +AGGATGCGGA GGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGGCGTGCGCCTGGAGGACGAGGGCCGGTACCGCTG +CGAGCTCATC AACGGCATCGAGGACGAGAGCGTGGCGCTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCA +GCCGGGGCCG GTACCAGTTCAATTACTACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCC +CAGCTCTACC AGGCTTGGACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGT +GCTCACCGCA CGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCGACC +GCTACGACGC CTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGACGCTGTCTGAAGCC +CACGCGGCGT GCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCCGCCTGGAAGTTTTCGGGGCTAGA +CCAGTGCGAC GGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAATCACCACGCCGAGGCCGCGCTGCGGGGGGCTCC +CGGATCCCGG AGTGCGCAGTTTCGGCTTCCCCAGGCCCCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG +GCGCCCACCG TGTCCCCTCCAGCGCGCGCGAAGAAGCTTGGGAGTCGTGGCGGGGGTCTCTCGCCACCCCTTTCCGGAGA +GCCTCCCCTC CCTCCAGACCCGGAGCGGCCTCTCCAGACCTGCCTTCCCAGCCGGGGGCTGCGGGCCTCGGACCCCGGCT +GGCCCGGCGG CGGGGAGGGGAGGCGGGGGCGCCTCCGGCGGCGAGATGCAGAGGTGACCCTCGGACCCGCTGCCGTTCGC +GAACCCTAGC AGAGGACTCAGCCACCGCCGGGGGGAGGGTGAGGCGGCCGGGGGCATTAACTGACCTCTGAGTACAGCAA +TAAAATAACC TGGGGATCTTTAAAAAAAAAAAAAAAAAAAAAAAA C:\Users\Desktop>

I would like to get the cds sequence as given below:

>NM_021817.2:408-1430 Homo sapiens hyaluronan and proteoglycan link p +rotein 2 (HAPLN2), mRNA ATGCCAGGCTGGCTCACCCTCCCCACACTCTGCCGCTTCCTTCTTTGGGCCTTCACCATCTTCCACAAAG CCCAAGGAGACCCAGCATCCCACCCGGGCCCCCACTACCTCCTGCCCCCCATCCACGAGGTCATTCACTC TCATCGTGGGGCCACGGCCACGCTGCCCTGCGTCCTGGGCACCACGCCTCCCAGCTACAAGGTGCGCTGG AGCAAGGTGGAGCCTGGGGAGCTCCGGGAAACGCTGATCCTCATCACCAACGGACTGCACGCCCGGGGGT ATGGGCCCCTGGGAGGGCGCGCCAGGATGCGGAGGGGGCATCGACTAGACGCCTCCCTGGTCATCGCGGG CGTGCGCCTGGAGGACGAGGGCCGGTACCGCTGCGAGCTCATCAACGGCATCGAGGACGAGAGCGTGGCG CTGACCTTGAGCTTGGAGGGTGTGGTGTTTCCGTACCAACCCAGCCGGGGCCGGTACCAGTTCAATTACT ACGAGGCGAAGCAGGCGTGCGAGGAGCAGGACGGACGCCTGGCCACCTACTCCCAGCTCTACCAGGCTTG GACCGAGGGTCTGGACTGGTGTAACGCGGGCTGGCTGCTCGAGGGCTCCGTGCGCTACCCTGTGCTCACC GCACGCGCCCCGTGCGGCGGCCGAGGCCGGCCCGGGATCCGCAGCTACGGACCCCGCGACCGGATGCGCG ACCGCTACGACGCCTTCTGCTTCACCTCCGCGCTGGCGGGCCAAGTGTTCTTCGTGCCCGGGCGGCTGAC GCTGTCTGAAGCCCACGCGGCGTGCCGGCGACGCGGCGCCGTGGTGGCCAAGGTTGGGCACCTCTACGCC GCCTGGAAGTTTTCGGGGCTAGACCAGTGCGACGGCGGCTGGCTGGCTGACGGCAGTGTGCGCTTCCCAA TCACCACGCCGAGGCCGCGCTGCGGGGGGCTCCCGGATCCCGGAGTGCGCAGTTTCGGCTTCCCCAGGCC CCAACAGGCAGCCTATGGGACCTACTGCTACGCCGAGAATTAG

In reply to Is it possible to retrieve the coding sequence of a gene from NCBI GenBank database using perl ? by supriyoch_2008

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.