Re: Get 10,000 web pages fast

First: your script should check the robots.txt at each site to determine whether or not automated scraping is welcome.

You should also try to make sure that you don't hammer any of the sites you're hitting into the ground by spreading out accesses to a single server over time. Otherwise you'll unintentionally do a denial of service attack on the site you're fetching and really tick people off. If these are all different sites this isn't such a big deal.

As a rule of thumb, reaccessing another URL the same site in less than one second may get you noticed and possibly yelled at by both the person you're scanning, and the ISP you're using (or the IT guys if you're doing this at work).

Some sites (I know Yahoo! does it from having worked there) will actually stop serving you real pages and just return an error page if you hit them too hard or too often.

Comment on Re: Get 10,000 web pages fast