hacker has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to get LWP::Protocol::collect to bite off larger amounts of data? Turning on LWP::Debug qw(+);, I get responses like:
LWP::Protocol::collect: read 1418 bytes LWP::Protocol::collect: read 1418 bytes LWP::Protocol::collect: read 1418 bytes

It's very, very slow when fetching a page to take such a small 'nibble'. Is this server-side dependant? Can I increase that on my end, programatically, to speed things up? (changing the MTU didn't help, but it was worth a shot =). This isn't terribly important, but the next issue is..

I'm also using Parallel::ForkManager to spawn fetchers, and I've got 10 running concurrently:

my $pm = Parallel::ForkManager->new(10);

I notice that 90% or more of the script's execution time is spend inside the wait() portion of ForkManager. Why does it sit there so long, blocking on forked children?

My code looks roughly similar to this, and works, but now seems horribly slow, blocking on children. I recently moved from using arrays to store the data and links, to hashes (thanks Corion and jeffa), and at that point, I noticed things slowing down considerably. I see about 2,000 wait() events for every 1 fetch event from the children:

fetch_content(@urls); $pm->wait_all_children; ## Run when children are forked $pm->run_on_start( sub { my ($pid, $link) = @_; $link =~ s/\s+$//; ## Count the pages fetched thus far $pagecount++; } ); ## Run when child processes complete $pm->run_on_finish( sub { my ($pid, $exit_code, $ident) = @_; print "\n** $ident out of the pool ". "with PID $pid and exit code: $exit_code\n"; } ); ## Run when blocking/waiting for children $pm->run_on_wait( sub { print "-"x74, "\n"; print "\n** Waiting for child ...\n"; print "-"x74, "\n"; }, 0.1 ); ## Fetch the actual page and links sub fetch_content { my @urls = @_; for my $link (@urls) { my $pid = $pm->start($link) and next; # fetch the page, extract the links # (all of the fetching/extraction works) $pm->finish; } }

Replies are listed 'Best First'.
Re: Biting off more with LWP and problems with blocking forks()?
by perlplexer (Hermit) on May 25, 2003 at 15:54 UTC
    Can I increase that on my end, programatically, to speed things up?

    Try playing with the $size parameter in LWP::UserAgent::request().

    I notice that 90% or more of the script's execution time is spend inside the wait() portion of ForkManager. Why does it sit there so long, blocking on forked children?

    Well, yes. Why? Normally, the purpose of the parent process is to manage "worker" threads (processes). So, the parent process usually doesn't perform any useful tasks other than dispatching new processes to handle requests. If you need your parent process to do something while children are doing their stuff, you may want to investigate the use of waitpid() in its non-blocking mode -- waitpid(-1, WNOHANG). I'm not familiar with Parallel::ForkManager though and I'm not sure if there's a method that supports this functionality.

    --perlplexer
Re: Biting off more with LWP and problems with blocking forks()?
by no_slogan (Deacon) on May 25, 2003 at 17:37 UTC
    It's very, very slow when fetching a page to take such a small 'nibble'. Is this server-side dependant?

    It actually depends on every network and router your traffic has to pass through. Taking small nibbles is built into the nature of the Internet. Normally, TCP tries to keep several packets "in the air" at once, so that the full capacity of the link is always being used. Some sites (notably Yahoo) seem to disable this, though, so you have to wait two complete round-trip times for each packet. There's nothing you can do about that.

    BTW, what kind of link are you trying to download all this stuff through? Maybe it's getting congested -- have you tried using fewer processes?