I didn't think I'd be able to answer this one, but then I saw the site you were trying to crawl.
NCBI has a nice API for retrieving data - you don't have to walk through the pages.
The real solution here is to use some modules specifically designed for getting data from NCBI. These modules use their API properly, instead of fudging through javascript. I suggest
Bio::Perl for genomic data (I think it can do pubmed articles too -- ah yes,
Bio::Biblio - they even have a sample script for PubMed queries included with the bioperl distribution), or NCBI's own
Entrez Programming Utilities.
Personally, I use the Bio::Perl modules on a daily basis for a great deal of the work I do with NCBI data.
As for legal status, the data is freely available. They do specifically ask you use their API rather than spidering the pages. ;-)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.