ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have an "import" script that I'm using, which "grabs" images from one of our other sites - and then inserts them into itself. The code is as follows:
$tmp[3] =~ s/^\[|\]$//g; foreach ( split /,/, $tmp[3]) { my $ext = (reverse split /\./, $_)[0]; my $fname = "/home/user/public_html/cgi-bin/links/admin/IMPORT/tmp +_images/" . CORE::time() . random_string() . ".$ext"; `wget --quiet -O $fname "$fname" "$_"`; $hit->{"Image$images_count"} = GT::SQL::File->open($fname); $images_count++; }


This is ok - but its pretty slow. Is there a better way to "grab" all the images at once? Would that even make a difference? (we have potentially 13 images per listing, and over 13,000 listings).

What I was thinking of, is getting all 13 of those images at once... and THEN doing the loop (without individual requests, which I guess is whats slowing it down)

Any suggestions?

TIA

Andy

Replies are listed 'Best First'.
Re: Quicker way to batch grab images?
by davido (Cardinal) on Feb 12, 2014 at 16:37 UTC

    The fastest that script will ever be is dependent on the time it takes each file to download sequentially. If each image takes three seconds, in a world with no server latency or network bottlenecks, you cannot finish in under 5.8 days (13 images per each of 13000 listings, at 3s per image). This is because you are doing blocking requests; your script is waiting for wget to finish (in order to retrieve its output, which you never use) before moving on to the next request. So as you make a request, that request must be finished before you move on to the next.

    However, if you can process several images at a time, say all thirteen from one listing before moving on to the next, you will be constrained more by network bandwidth, and less by raw throughput of an individual file. I cobbled together an example of parallel non-blocking requests in this response: Re: use LWP::Simple slows script down..

    Apply those principles to your project, and you will reduce the time needed considerably. Let's say you have sufficient bandwidth to handle 13 incoming files at a time, and that instead of 3 seconds per file it now takes 6 because you've increased the load on the remote server. But instead of 3*13*13000 seconds, because you are requesting batches of 13, and waiting for them to finish before moving on, you are now looking at 6*13000, or less than a day to complete.

    Even more efficient would be to just limit the total number of requests to some number that your bandwidth and the remote server can handle, and not be concerned with finishing an entire listing before moving on to the next.


    Dave

      Another way, and similar to Re: use LWP::Simple slows script down., is to use Parallel::ForkManager & wget|curl (same caveats apply as in that sub thread).

      A (useless) data point: With that combination and 6 sub-processes, I was able to download 100+ files around 1-10 MB each ridiculously in much less time -- did not track actual numbers for comparison -- than serial download. A single download takes 2 s to 5 s most of the time; some of worse cases are ~15 s.

Re: Quicker way to batch grab images?
by Laurent_R (Canon) on Feb 12, 2014 at 18:37 UTC
    Hm, it may sound stupid, but, just in case, are all the images different or do some of them get reused? If some do, then you should probably cache them locally.
Re: Quicker way to batch grab images?
by stonecolddevin (Parson) on Feb 17, 2014 at 22:03 UTC

    2 things:

    1. Is there any way you could use rsync for this? Or are you effectively scraping another site for the images? If you are just connecting to a server and grabbing the image files, you should look into rsync.
    2. Are you doing any other operations whilst procuring said images? If you're updating database records in tandem, etc. then obviously you're going to want to move those outside of your image downloading logic.

    I've done a ton of image moving through S3, and I've found a lot of success with something like Parallel::Runner. I think davido's response is probably your best bet. Parallelize, and if at all possible, do some horizontal scaling so you can have multiple worker machines chipping away at your queue.

    Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past