Becky has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've been getting protein sequences from the NCBI for years using code like this:
use LWP; use HTTP::Request::Common; my $ua = new LWP::UserAgent; my $result = $ua->request(GET "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&qty=1&start +=1&list_uids=148261691&dopt=fasta"); if($result->is_success){ $fasta = $result->content; } # then parse $fasta to get out just my protein sequence
However, this no longer works as the NCBI have started using javascript. If you copy in the url above and view the page source you'll see what I mean - the sequence is no longer visible in the page source. The sequence should start

>gi|148261691|ref|YP_001235818.1| CheA signal transduction histidine kinase Acidiphilium cryptum JF-5 MTGGGSMDPMAEIRETFFQECEEQLAELESGLMRMEAGETDSETVNAVFRAVHSIKGGAGAFGLEDLVHF

Can anyone tell me how to get my sequences now? Thanks, Becky

Replies are listed 'Best First'.
Re: Getting data from NCBI
by derby (Abbot) on Apr 30, 2009 at 12:27 UTC

    Becky, NCBI switched over to a web service model years ago. Check out their eutils page for more info. If you do a lot of work with their data, I would recommend signing up on their mailing list.

    For this particular query I think you would need to make an esearch request and then an efetch (but the folks at NCBI would now better):

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/search.fcgi?db=protein&te +rm=148261691&rettype=uilist&usehistory=y
    and then using the info from esearch (basically WebEnv and query_key):
    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&We +bEnv=<xxx>&query_key=<yyy>&rettype=fasta&retmode=xml&sort=pub+date

    -derby

    Update: The NCBI folks may have a better way of *directly* pulling the data based on ids -- I only query pubmed and the app normally has to do a search first so this search/fetch approach always worked well for me.

    Update: Well ... it looks like you can directly pull:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id +=148261691&rettype=fasta
    Like I said, I really like what the NCBI folks are doing.

      I actually had a java program that did something very similar to this. It was very nice to just have the sequence as a string to read in, no parsing required! You can get the bare fasta formated sequence pretty easily if you use something like this:

      my $giNum = 148261691 my $seq = "http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protei +n&qty=1&c_start=1&list_uids=" .$giNum ."&uids=&dopt=fasta&dispmax=5&sendto=t&from=begin&to=end";


      and then do something similar to what you're doing above.

      good luck!
      You're a star, thanks! I never knew about that stuff before but will use it from now on! Becky
Re: Getting data from NCBI
by frieduck (Hermit) on Apr 30, 2009 at 14:29 UTC