comment on

Is it possible to get LWP::Protocol::collect to bite off larger amounts of data? Turning on LWP::Debug qw(+);, I get responses like:

  LWP::Protocol::collect: read 1418 bytes
  LWP::Protocol::collect: read 1418 bytes
  LWP::Protocol::collect: read 1418 bytes
[download]

It's very, very slow when fetching a page to take such a small 'nibble'. Is this server-side dependant? Can I increase that on my end, programatically, to speed things up? (changing the MTU didn't help, but it was worth a shot =). This isn't terribly important, but the next issue is..

I'm also using Parallel::ForkManager to spawn fetchers, and I've got 10 running concurrently:

   my $pm = Parallel::ForkManager->new(10);
[download]

I notice that 90% or more of the script's execution time is spend inside the wait() portion of ForkManager. Why does it sit there so long, blocking on forked children?

My code looks roughly similar to this, and works, but now seems horribly slow, blocking on children. I recently moved from using arrays to store the data and links, to hashes (thanks Corion and jeffa), and at that point, I noticed things slowing down considerably. I see about 2,000 wait() events for every 1 fetch event from the children:

  fetch_content(@urls);
  $pm->wait_all_children;

  ## Run when children are forked
  $pm->run_on_start(
          sub { my ($pid, $link) = @_;
                  $link =~ s/\s+$//;  

                  ## Count the pages fetched thus far
                  $pagecount++;
          }
  );

  ## Run when child processes complete
  $pm->run_on_finish(
          sub { my ($pid, $exit_code, $ident) = @_;
                  print "\n** $ident out of the pool ".
                        "with PID $pid and exit code:
                         $exit_code\n";
          }
  );
  
  ## Run when blocking/waiting for children
  $pm->run_on_wait(
          sub {
                  print "-"x74, "\n";
                  print "\n** Waiting for child ...\n";
                  print "-"x74, "\n";
          },
          0.1
  );

  ## Fetch the actual page and links  
  sub fetch_content {
        my @urls = @_;

        for my $link (@urls) {
                my $pid = $pm->start($link) and next;

                # fetch the page, extract the links
                # (all of the fetching/extraction works)

                $pm->finish;
        }
  }
[download]

In reply to Biting off more with LWP and problems with blocking forks()? by hacker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.