in reply to IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?

What's easiest will depend on your existing code but it could well be that LWP::Parallel could just slot in nicely. Fetch in batches of "a dozen or two" in parallel, then process the results before moving on to the next batch. It won't be the absolute most efficient but it will get you from serial to parallel for what I expect to be the slow part of the process with minimal fuss.


🦛

  • Comment on Re: IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?

Replies are listed 'Best First'.
Re^2: IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?
by mldvx4 (Hermit) on Sep 05, 2023 at 15:19 UTC

    Thanks. I took a closer look at LWP::Parallel and it may come in handy later¹. From the documentation, I see how to fetch batches of links and wonder if there is any way to parallelize the processing of the results in the same move. That way fetch+process runs continuously, rather than fetch in parallel wait and then process in parallel.

    I'll be digging into the options mentioned in the other threads, too.

    #!/usr/bin/perl use LWP::Parallel::UserAgent; use strict; use warnings; my @feeds = ( 'http://localhost/feed1.xml', # rss 'http://localhost/feed2.xml', # atom 'http://localhost/foo', # 404 ); my $requests = &prepare_requests(@feeds); my $entries = &fetch_feeds($requests); foreach my $k (keys %$entries) { my $res = $entries->{$k}->response; print "Answer for '",$res->request->url,"' was \t", $res->code,": ", $res->message,"\n"; # $res->content,"\n"; } exit(0); sub prepare_requests { my (@feeds) = (@_); my $requests; foreach my $url (@feeds) { push(@$requests, HTTP::Request->new('GET', $url)); } return($requests); } sub fetch_feeds { my ($requests) = (@_); my $pua = LWP::Parallel::UserAgent->new(); $pua->in_order (0); # handle requests in order of registration $pua->duplicates(1); # ignore duplicates $pua->timeout (9); # in seconds $pua->redirect (1); # follow redirects $pua->max_hosts (3); # max locations accessed in parallel foreach my $req (@$requests) { print "Registering '".$req->url."'\n"; if (my $res=$pua->register($req, \&handle_answer, 8192, 1)) { # print STDERR $res->error_as_HTML; print $res->error_as_HTML; } else { print qq(ok\n); } } my $entries = $pua->wait(); return($entries); } sub handle_answer { my($content, $response, $protocol, $entry) = @_; if (length($content)) { $response->add_content($content); } else { 1; } return(undef); }

    ¹ That's the thing about CPAN, there are so many useful modules with great accompanying documentation that discovery can be a challenge. So I am very appreciative of everyone's input here.

      wonder if there is any way to parallelize the processing of the results in the same move

      Have you tried just putting your processing in the handle_answer callback? That may be all you need. But I would still profile it first because it would be a big surprise if the feed processing step weren't dwarfed by the fetch times.


      🦛

        Have you tried just putting your processing in the handle_answer callback?

        Yes, I had started to look at that. As far as I can tell handle_answer keeps getting called only until the HTTP response is complete. I suppose there is a way to identify when the response is finally complete? For now, I will try building out the script as-is, with LWP in parallel and the rest serial. Some of the preparatory parts seem much faster than expected, based on trials in Perl compared to another scripting language. So it may very well not be an issue. Although I have lots of RAM on this computer, I'd like to find a way to process the responses as they come in so that all 700 to 800 of them don't sit around whole at the same time.

        Resuming more reading...

Re^2: IPC::Open, Parallel::ForkManager, or Sockets::IO for parallelizing?
by mldvx4 (Hermit) on Oct 02, 2023 at 23:33 UTC

    Thanks. I've taken a closer look at LWP::Parallel now and have some questions about how it should handle many (most?) HTTPS sites. For now, it seems to return HTTP Status "503 Service Unavailable" for ones that exist and are accessible via other agents. Here is one example:

    #!/usr/bin/perl use LWP::Parallel::UserAgent; use LWP::Debug qw(+); use strict; use warnings; my $headers = new HTTP::Headers( 'User-Agent' => "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 +(KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36", ); my @requests; foreach my $url ('https://blog.arduino.cc/feed/') { push(@requests, HTTP::Request->new('GET', $url, $headers)); } # new parallel agent my $pua = LWP::Parallel::UserAgent->new(); $pua->in_order (0); $pua->duplicates(1); $pua->timeout (9); $pua->redirect (0); $pua->max_hosts (5); $pua->nonblock (0); foreach my $req (@requests) { if ( my $res = $pua->register ($req, \&handle_answer, 8192) ) { print $res->error_as_HTML; } else { print qq(ok\n); } } my $entries = $pua->wait(); foreach my $k (keys %$entries) { my $res = $entries->{$k}->response; my $url = $res->request->url; print $res->code,qq(\t $url\n); } exit(0); sub handle_answer { my($content, $response, $protocol, $entry) = @_; if (length($content)) { $response->add_content($content); } return(undef); }

    As one can see with various browsers the feed in question is there but yet it is one of the feeds that LWP::Parallel is choking on.

      have some questions about how it should handle many (most?) HTTPS sites.

      Yeah, it seems to be pretty much all of them, which is a real shame. I guess it must have been about 6 or 7 years ago that I last used LWP::Parallel for anything serious and back then this wasn't really an issue. In the meantime the heavy hand of Google has de-facto forced most of the web over onto HTTPS and now this is a major consideration.

      Having tested this briefly against one of my own sites it does actually appear to be downloading the content in that the server receives, accepts and serves the request OK. It's just that the user agent has some sort of internal problem with the response.

      It might be worth raising a ticket although there are plenty open. Still, it would alert other users to the problem.


      🦛

        I'm willing to try making a bug report. Is there an alternate approach to raising a ticket, other than that link?

        The self-censored comment: I ask because the link offered goes not to a web page or web site but a "web app", and a broken "web app" at that. After 10 minutes of faffing about with broken "web app", I was able to create an account and log in. However, after an additional 15 minutes I was not able to make any headway in actually getting a complete web form in order to report a bug let alone actually report a bug. I enjoy Perl a lot and am really grateful for all the knowledge here here but have zero tolerance for javascript, especially when it is abused to block what used to be a simple activity. :(