LWP::Protocol::collect: read 1418 bytes LWP::Protocol::collect: read 1418 bytes LWP::Protocol::collect: read 1418 bytes
It's very, very slow when fetching a page to take such a small 'nibble'. Is this server-side dependant? Can I increase that on my end, programatically, to speed things up? (changing the MTU didn't help, but it was worth a shot =). This isn't terribly important, but the next issue is..
I'm also using Parallel::ForkManager to spawn fetchers, and I've got 10 running concurrently:
my $pm = Parallel::ForkManager->new(10);
I notice that 90% or more of the script's execution time is spend inside the wait() portion of ForkManager. Why does it sit there so long, blocking on forked children?
My code looks roughly similar to this, and works, but now seems horribly slow, blocking on children. I recently moved from using arrays to store the data and links, to hashes (thanks Corion and jeffa), and at that point, I noticed things slowing down considerably. I see about 2,000 wait() events for every 1 fetch event from the children:
fetch_content(@urls); $pm->wait_all_children; ## Run when children are forked $pm->run_on_start( sub { my ($pid, $link) = @_; $link =~ s/\s+$//; ## Count the pages fetched thus far $pagecount++; } ); ## Run when child processes complete $pm->run_on_finish( sub { my ($pid, $exit_code, $ident) = @_; print "\n** $ident out of the pool ". "with PID $pid and exit code: $exit_code\n"; } ); ## Run when blocking/waiting for children $pm->run_on_wait( sub { print "-"x74, "\n"; print "\n** Waiting for child ...\n"; print "-"x74, "\n"; }, 0.1 ); ## Fetch the actual page and links sub fetch_content { my @urls = @_; for my $link (@urls) { my $pid = $pm->start($link) and next; # fetch the page, extract the links # (all of the fetching/extraction works) $pm->finish; } }
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |