Re^2: Advice on Efficient Large-scale Web Crawling

merlyn, I'm afraid I do have to hit this number of external URLs. :-) It's for a research project that does have many merits.

Then use the Google API and their database. Or, you can also use the newly announced Alexa API from Amazon.

There's no justified reason to re-crawl the world, unless you're also providing benefit to the world, and you haven't yet convinced me of your benefit ("research project" could mean anything).

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Comment on Re^2: Advice on Efficient Large-scale Web Crawling

Replies are listed 'Best First'.
Re^3: Advice on Efficient Large-scale Web Crawling by Scott7477 (Chaplain) on May 07, 2006 at 06:54 UTC
According to Google API a license key only allows for 1,000 automated queries per day. This page while somewhat dated, provides some data relevant to this discussion. A couple of key points from that data include: -Netcraft estimated that 42.8 million web servers existed. Assuming 50 URLs per web server gives over 2.1 billion URLs. If the OP is randomly selecting URLs the chances of any particular server being significantly inconvenienced are small, in my estimation.	[reply]

Replies are listed 'Best First'.

Re^3: Advice on Efficient Large-scale Web Crawling
by Scott7477 (Chaplain) on May 07, 2006 at 06:54 UTC

Google API

page

[reply]