in reply to Quicker way to batch grab images?

The fastest that script will ever be is dependent on the time it takes each file to download sequentially. If each image takes three seconds, in a world with no server latency or network bottlenecks, you cannot finish in under 5.8 days (13 images per each of 13000 listings, at 3s per image). This is because you are doing blocking requests; your script is waiting for wget to finish (in order to retrieve its output, which you never use) before moving on to the next request. So as you make a request, that request must be finished before you move on to the next.

However, if you can process several images at a time, say all thirteen from one listing before moving on to the next, you will be constrained more by network bandwidth, and less by raw throughput of an individual file. I cobbled together an example of parallel non-blocking requests in this response: Re: use LWP::Simple slows script down..

Apply those principles to your project, and you will reduce the time needed considerably. Let's say you have sufficient bandwidth to handle 13 incoming files at a time, and that instead of 3 seconds per file it now takes 6 because you've increased the load on the remote server. But instead of 3*13*13000 seconds, because you are requesting batches of 13, and waiting for them to finish before moving on, you are now looking at 6*13000, or less than a day to complete.

Even more efficient would be to just limit the total number of requests to some number that your bandwidth and the remote server can handle, and not be concerned with finishing an entire listing before moving on to the next.


Dave

Replies are listed 'Best First'.
Re^2: Quicker way to batch grab images?
by Anonymous Monk on Feb 13, 2014 at 03:17 UTC

    Another way, and similar to Re: use LWP::Simple slows script down., is to use Parallel::ForkManager & wget|curl (same caveats apply as in that sub thread).

    A (useless) data point: With that combination and 6 sub-processes, I was able to download 100+ files around 1-10 MB each ridiculously in much less time -- did not track actual numbers for comparison -- than serial download. A single download takes 2 s to 5 s most of the time; some of worse cases are ~15 s.