in reply to Need to Improve Scraping Speed

If you know it takes 8 hours, that means you're done. Who cares now how long it took? You saved the data didn't you?

Now stick it all in a database and periodically fetch only more recent data.

Replies are listed 'Best First'.
Re^2: Need to Improve Scraping Speed
by cheech (Beadle) on Dec 03, 2009 at 20:02 UTC
    The code will be bundled with other scripts and sent to other machines for running from scratch. I'm looking for coding improvements that might speed it up a bit.

    Thanks

      Probably 99% or more of that 8 hours is spent waiting on the server. You can't speed it up by fiddling with the client. Of course you could parallelize some fetches - but that is expensive for your good-will information provider.

      Probably your best bet would be to package up the already-downloaded text file (compress the heck out of it) and ship that off with your code. Then a bare machine will load that file first before updating from the server.

      I already gave you that answer after I gave you a program to scrape the data once and only once

      The solution is to scrape the data once and only once, repackage it, compress it, and host it as a few compressed files on an ftp server.

      CGI is the wrong way to distribute this data, and you shouldn't distribute this scraper program, its like distributing like a denial of service tool.