Using LWP::Parallel

wilstephens has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
use POE; by mugwumpjism (Hermit) on Mar 08, 2002 at 13:32 UTC
You might want to check out POE::Component::Client::UserAgent. One of the test scripts (`t/02-multi.t`) does pretty much what you want. Update 11/03/02: Here is a much shorter script, that at least demonstrates fetching a list of documents, and uses 10 threads. Liberal adding of "print" statements to assist understanding of the code is left as an exercise for the reader. #!/usr/bin/perl -w use strict; use POE; use POE::Component::Client::UserAgent; my @urls = map { "http://sam.vilain.net/$_" } qw(index.html hobbies.html CV foo bar); sub _start { $_[HEAP]->{alias} = "useragent".$_[ARG0]; POE::Component::Client::UserAgent->new (alias => $_[HEAP]->{alias}); $_[KERNEL]->yield("next"); } sub next { if (my $url = pop @urls) { $_[KERNEL]->post ( $_[HEAP]->{alias} => "request", { request => HTTP::Request->new(GET => $url), response => $_[SESSION]->postback('next') } ); } else { $_[KERNEL]->post ( $_[HEAP]->{alias} => "shutdown", ); } } for (1..10) { POE::Session -> create ( inline_states => { _start => \&_start, next => \&next, }, args => [ $_ ], # arguments ); } $poe_kernel->run(); [download] PS LWP::Parallel::UserAgent 2.51 has a bug; The line 182 of LWP::Parallel::UserAgent.pm needs to be `my $self = new LWP::UserAgent;` instead of `my $self = new LWP::UserAgent $init;` You may also need to insert the line `use HTML::HeadParser;` anywhere near the top of `LWP::Parallel::Protocol.pm.`	[reply] [d/l] [select]
Re: Using LWP::Parallel by Juerd (Abbot) on Mar 08, 2002 at 10:54 UTC
`... while (my $ref = $sth->fetchrow_hashref) { $pua->register( HTTP::Request->new(GET => $ref->{'url_en'}) ); } my @results = $pua->wait(); for (@results) { my $res = $_->response; ... } ...` [download] Good luck! ++ vs lbh qrpbqrq guvf hfvat n ge va Crey :) Nabgure bar vs lbh qvq fb jvgubhg ernqvat n znahny svefg. -- vs lbh hfrq OFQ pnrfne ;) - Whreq	[reply] [d/l]
Re: Using LWP::Parallel by hossman (Prior) on Mar 09, 2002 at 00:47 UTC
Maybe i'm missing something, but in neither of the 2 replies posted so far do i see anything that seems to address wilstephens's specific question about limiting to amount of parrallellisssmmmm to a pecific number (his example was 10). I've never used LWP::Parallel, but after a quick glance at the docs (the concept intrigues me) I notice this method for LWP::Parallel::RobotUA ... `$pua->max_req ( 2); # max parallel requests per server` [download] but that's not quite what we're looking for. I can't for the life of me see anything obvious here. I would guess you could do this by only registering 10 URLs at a time, getting them in parrallel, then registering another 10, etc. But this is wastefull of cycles. Some URLs will finish faster then others, but you're wiating for all 10 before proceeding to #11. (I would guess there is a better way) UPDATE: LWP::Parallel::UserAgent also seems to have a `max_req` method, as well as a `max_hosts` method -- so if you know that all of your URLs wer for different hosts, OR were all for the same host, you could use one of them to get your result, but i still don't see anyway to optimally fetch no more then 10 URLs a time without any pre previous knowledge of your URLs. method	[reply] [d/l] [select]
Re: Re: Using LWP::Parallel by mugwumpjism (Hermit) on Mar 11, 2002 at 19:44 UTC
Actually, it's quite trivial to get the POE solution to fetch 10 documents at a time; you just create 10 sessions which continually (get a URL off a master list, then fetch that document).	[reply]
Re: Re: Re: Using LWP::Parallel by hossman (Prior) on Mar 11, 2002 at 20:33 UTC
I wouldn't call it trivial if people have no experience with POE. I for one didn't understand from your example how you were suggesting it would help him towards his goal of retrieving no more then 10 at a time. After reading the POE, POE::Session, & POE::Component::Client::UserAgent I think what you're suggesting is that the poster modifying 02multi.t so that this... `POE::Session -> create ( inline_states => { _start => \&_start, _stop => \&_stop, response => \&response, _signal => \&_signal }, );` [download] becomes this... `for (my $i = 0; $i < 10; $i++) { POE::Session -> create ( inline_states => { _start => \&_start, _stop => \&_stop, response => \&response, _signal => \&_signal }, ); }` [download] and change `_start` to only loop over a 1/10 of `@urls`. Does that sound about right? Frankly, i'm still not clear on why this can't be done in a more straight forward manner with LWP::Parallel directly, It has the functionality to limit the number of parrallel requests to an individual server -- OR to limit the number of different servers it sends requests to at the same time, ... why isn't there a more general way to limit the TOTAL number of parrallel requests?	[reply] [d/l] [select]
Re: Re: Re: Re: Using LWP::Parallel by mugwumpjism (Hermit) on Mar 12, 2002 at 20:13 UTC
Re (tilly) 1: Using LWP::Parallel by tilly (Archbishop) on Mar 09, 2002 at 03:27 UTC
This is not an answer to your question. Rather it is a comment for everyone who answered it without noticing this very important detail. What you have written is a robot. It is very bad netiquitte to not look for and respect robots.txt. Any time anyone asks a question where it is clear from their code that they have written a robot that doesn't do this, please make sure to bring up robots.txt, and point to WWW::RobotRules. (Which comes with LWP.)	[reply]
Re: Re (tilly) 1: Using LWP::Parallel by hossman (Prior) on Mar 09, 2002 at 04:04 UTC
Given that: The User agent is "OpticDB LinkCheck/0.1" The list of links pinged is stored in a Database The pages aren't being scraped for more links The assumption that it's a Robot seems misplaced. Seems more like a site analyzer to me. (ie: check that 'important' urls are working.)	[reply]
Re: Re: Re (tilly) 1: Using LWP::Parallel by wilstephens (Acolyte) on Mar 09, 2002 at 18:22 UTC
Thank you for all your replies! After some more help from c.l.p.m, I've managed to reach thus far (code below). However, some questions remains unanswered, ie, the need to limit the number of parallel connections. And for the record, this is just a script to check if the links I have in a database are correct, ie, I'm checking for a return status code of 200, and if not, grab the status code so I can delete them from the database. sub check_links_results { print $query->header; use LWP::Parallel::UserAgent qw(:CALLBACK); my $ua = LWP::Parallel::UserAgent->new; $ua->nonblock(1); $ua->agent("OpticDB LinkCheck/0.1"); connect_to_db(); my $clock_start = time(); $sth = $dbh->prepare("SELECT url_en,id FROM $DB_MYSQL_NAME"); $sth->execute (); my %ids; while( my ($url, $id) = $sth->fetchrow_array ) { $ids{$url} = $id; $ua->register(HTTP::Request->new(GET => $url)); } $sth->finish; $dbh->disconnect; my $responses = $ua->wait; my $clock_finish = time - $clock_start; # end timer + and compare $time_taken = sprintf ("%.2f", $clock_finish); # trim time +to 2 decimal points my ($count, $htmlout) = (0, ""); while( (undef, my $entry) = each %$responses ) { my $req = $entry->request; my $res = $entry->response; my $id = $ids{$req->url}; next if $res->code == 200; ++$count; $tmpl_show_record .= qq\| <table width="95%" border="0" cellspacing="0" cell +padding="2"> <tr> <td width="2%" align="middle"> </td> <td width="6%" bgcolor="#EEEECC" align="right" val +ign="top"><font face="Arial, Helvetica, sans-serif" size="2">$ref->{i +d}</font> </td> <td width="58%" bgcolor="#E9EBEF"> <font face +="Arial, Helvetica, sans-serif" size="2">$ref->{'name_en'}</font></td +> <td width="20%" bgcolor="#FFDDDD"> <font face +="Arial, Helvetica, sans-serif" size="2">$res_code : $res_msg</font>< +/td> <td width="14%" bgcolor="#EEEECC" valign="top" ali +gn="center"><a href="odb.cgi?action=edit_record&id=$ref->{'id'}"><img + src="/images/icons/edit.gif" width="15" height="15" alt="[ edit ]" b +order="0"></a>   <a href="odb.cgi?action=del_record&id=$ref->{'id'} +" onClick="return confirm('Delete record $ref->{'id'}?')"><img src="/ +images/icons/delete.gif" width="15" height="15" alt="[ delete ]" bord +er="0"></a>   <a href="odb.cgi?action=toggle_live&id=$ref->{'id' +}"> \|; if ($data_status eq "Live") { $tmpl_show_record .= "<img src=\"/images/icons/liv +eyes.gif\" border=\"0\">"; } else { $tmpl_show_record .= "<img src=\"/images/icons/liv +eno.gif\" border=\"0\">"; } $tmpl_show_record .= qq\| </a> </td> </tr> </table> <BR> \|; } $num_dead = $count; if( $count == 0 ) { &error_html("No dead links found!"); exit; } $dbh->disconnect; &parse_template("$PATH_TEMPLATE/check_links_results.tmpl"); } [download] -- Wiliam Stephens <wil@stephens.org>	[reply] [d/l]
Re: Re: Re: Re (tilly) 1: Using LWP::Parallel by Juerd (Abbot) on Mar 09, 2002 at 19:13 UTC