Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So I have a script that is harvesting and collating HTML from more than one site, as in:

use LWP::Simple; $contentsA = get "$urlA"; $contentsB = get "$urlB";
and so on. Now as I add N sites, the programs slows down linearly with N. What I'd really like to do is launch requests for sites A..N in parallel.

fork seems the wrong way to do this -- should I drop LWP and open a pipe with lynx, like

open (CONTENTSA, "|lynx -d $urlA"); or is there some other (non-threaded) way of launching several LWP requests in parallel?

-clay

Edit kudra, 2001-09-19 Added code tags

Replies are listed 'Best First'.
Re: Getting html from more than one site simultaneously?
by merlyn (Sage) on Sep 18, 2001 at 19:23 UTC
Re: Getting html from more than one site simultaneously? (boo)
by boo_radley (Parson) on Sep 18, 2001 at 19:29 UTC
    LWP has a parallel module -- you probably want to take a look at LWP::Parallel before you start resorting to piping through lynx.
    Update : Teach me to answer the phone before submitting :)
Re: Getting html from more than one site simultaneously?
by perrin (Chancellor) on Sep 18, 2001 at 19:47 UTC
    If you don't mind using an external program, you could look at PUF. Also, POE has a component for this. But Randal's forking approach is simpler and more robust.
Re: Getting html from more than one site simultaneously?
by Zaxo (Archbishop) on Sep 19, 2001 at 05:27 UTC

    As an alternative to the other suggestions, you can use Parallel::ForkManager with LWP::Simple.

    This is example code from 'man Parallel::ForkManager':

    use LWP::Simple; use Parallel::ForkManager; ... @links=( ["http://www.foo.bar/rulez.data","rulez_data.txt"], ["http://new.host/more_data.doc","more_data.doc"], ... ); ... # Max 30 processes for parallel download my $pm = new Parallel::ForkManager(30); foreach my $linkarray (@links) { $pm->start and next; # do the fork my ($link,$fn) = @$linkarray; warn "Cannot get $fn from $link" if getstore($link,$fn) != RC_OK; $pm->finish; # do the exit in the child process } $pm->wait_all_children;

    First you need to instantiate the ForkManager with the "new" constructor. You must specify the maximum number of processes to be created. If you specify 0, then NO fork will be done; this is good for debugging purposes.

    Next, use $pm->start to do the fork. $pm returns 0 for the child process, and child pid for the parent process (see also perlfunc(1p)/fork()). The "and next" skips the internal loop in the parent process. NOTE: $pm->start dies if the fork fails.

    $pm->finish terminates the child process (assuming a fork was done in the "start").

    Hope that helps.

    After Compline,
    Zaxo