Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I'm looking for the *fastest* way to download 100 URL's in parallel - ideally using Perl.

I'm currently usign fork() and LWP::Simple but would prefer not to spawn 100+ sub-processes. I've looked at Perl threads but want to steer clear until they're stable.

I'm on a Linux machine --- does anyone know of some low-level C program that will do the IO in parallel, that ideally has a Perl wrapper?

Is LWP::Parallel the fastest Perl way to do this?

Nige

  • Comment on Downloading URL's in Parallel with Perl

Replies are listed 'Best First'.
Re: Downloading URL's in Parallel with Perl
by mr.nick (Chaplain) on Sep 11, 2001 at 16:14 UTC
    Look for the module LWP::Parallel. It does what you want without the overhead of fork() or Thread.

    mr.nick ...

Re: Downloading URL's in Parallel with Perl
by MZSanford (Curate) on Sep 11, 2001 at 16:11 UTC
    If speed is the key, but process-spawning is too slow, threads are what you need. As you said, Perl Threads are experimental. To use many C threading, i believe you would end up finding it easier to write the program in C/C++. You might want to look into using fork() to spawn 50 processes, and have each get two files, or some such configuration.
    I would think of fork() as the most 'perl' way of doing it in the end, but i have never used LWP::Parallel, and would assume it is based on IO Multiplexing (good section in Network Programming with Perl), which is complex, and eventually the disk IO will cause slowness with large amounts of data, but i would benchmark to find out if it is any quicker than the fork version.
    speling champ of tha claz uf 1997
    -- MZSanford
Re: Downloading URL's in Parallel with Perl
by eduardo (Curate) on Sep 11, 2001 at 17:01 UTC
    I just wanted to make a comment real quick. Make sure that you are not falling to fallatious logic in believing that somehow calling forth the gods of "explicit parallelism" you will be guaranteed to incurr a speedup. Remember, in situations like this, where you are *pulling* on the dataflow, and more importantly, your data set rests on a node to which you do not have a guaranteed transfer rate, it is possible that parallelizing your GET's will not increase your actual throughput or minimize your wall-clock time for the entire transaction.

    Remember your Von Neumann bottleneck, it is doubtful that what is slowing down your task is the overhead of processing the data, it is much more likely that the bottleneck exists in the actual data pipe (in other words, not processing BUT bandwidth!) And attempting to stuff 10k/sec of data down a 1k/sec pipe won't make the pipe bigger... it may actually end up slowing down your overall wall-clock time due to TCP collisions and other assorted baddies. I'm a big fan of parallelization throughout... just make sure it makes *sense* in your particular configuration.

Re: Downloading URL's in Parallel with Perl
by larsen (Parson) on Sep 11, 2001 at 23:49 UTC
Re: Downloading URL's in Parallel with Perl
by Ntav (Sexton) on Sep 11, 2001 at 22:39 UTC
    I'd *really* reccomend LWP::Parallel::UserAgent - as well as retrieving in parallel its easy to set a timeout, deal with redirects, claim to be using such and such browser etc. Check the doc for skeleton code for the problem you describe.
    Ntav.
    NAPH
Re: Downloading URL's in Parallel with Perl
by perrin (Chancellor) on Sep 11, 2001 at 17:24 UTC
    Use wget. It's very fast and has a simple command-line interface you can use.

    If you can't use that, stick with multi-process Perl. Forking is faster than LWP::Parallel. Use HTTP::GHTTP, or HTTP::Lite if you can't compile C extensions.

Re: Downloading URL's in Parallel with Perl
by tachyon (Chancellor) on Sep 11, 2001 at 16:44 UTC

    What's wrong with 100 processes? On Unix that's fine. Any decent OS will handle that fine which of course excludes Windows. As for raw speed - suck it and see is the traditional approach.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print