wilstephens has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm trying to convert the below code to use LWP::Parallel. This shoul be faifly simple right? Well, I'm not sure where to start off. How do I limit the number of parallel processing to say 10 instead of just slurping the entire database, for example?

use LWP::UserAgent; $ua = new LWP::UserAgent; $ua->agent("OpticDB LinkCheck/0.1"); &connect_to_db; my $clock_start = time; # start timer $sth = $dbh->prepare("SELECT * FROM $DB_MYSQL_NAME"); $sth->execute (); my $count = 0; while (my $ref = $sth->fetchrow_hashref ()) { my $req = new HTTP::Request GET => $ref->{'url_en'}; my $res = $ua->request($req); $res_id = $ref->{id}; $res_code = $res->code; $res_msg = $res->message; unless ($res_code eq "200") { $count ++; $tmpl_show_record .= qq| .. html to show erroneous records goes here ... |; } } $num_dead = $count; if ($count == 0) { &error_html("No dead links found!"); exit; } $sth->finish(); my $clock_finish = time - $clock_start; $time_taken = sprintf ("%.2f", $clock_finish); $dbh->disconnect;


Can anyone please help me modify the above code? Thanks!

--
Wiliam Stephens <wil@stephens.org>

Replies are listed 'Best First'.
use POE;
by mugwumpjism (Hermit) on Mar 08, 2002 at 13:32 UTC
    You might want to check out POE::Component::Client::UserAgent. One of the test scripts (t/02-multi.t) does pretty much what you want.

    Update 11/03/02: Here is a much shorter script, that at least demonstrates fetching a list of documents, and uses 10 threads. Liberal adding of "print" statements to assist understanding of the code is left as an exercise for the reader.

    #!/usr/bin/perl -w use strict; use POE; use POE::Component::Client::UserAgent; my @urls = map { "http://sam.vilain.net/$_" } qw(index.html hobbies.html CV foo bar); sub _start { $_[HEAP]->{alias} = "useragent".$_[ARG0]; POE::Component::Client::UserAgent->new (alias => $_[HEAP]->{alias}); $_[KERNEL]->yield("next"); } sub next { if (my $url = pop @urls) { $_[KERNEL]->post ( $_[HEAP]->{alias} => "request", { request => HTTP::Request->new(GET => $url), response => $_[SESSION]->postback('next') } ); } else { $_[KERNEL]->post ( $_[HEAP]->{alias} => "shutdown", ); } } for (1..10) { POE::Session -> create ( inline_states => { _start => \&_start, next => \&next, }, args => [ $_ ], # arguments ); } $poe_kernel->run();
    PS LWP::Parallel::UserAgent 2.51 has a bug;
    1. The line 182 of LWP::Parallel::UserAgent.pm needs to be my $self = new LWP::UserAgent; instead of my $self = new LWP::UserAgent $init;
    2. You may also need to insert the line use HTML::HeadParser; anywhere near the top of LWP::Parallel::Protocol.pm.
Re: Using LWP::Parallel
by Juerd (Abbot) on Mar 08, 2002 at 10:54 UTC
    ... while (my $ref = $sth->fetchrow_hashref) { $pua->register( HTTP::Request->new(GET => $ref->{'url_en'}) ); } my @results = $pua->wait(); for (@results) { my $res = $_->response; ... } ...
    Good luck!

    ++ vs lbh qrpbqrq guvf hfvat n ge va Crey :)
    Nabgure bar vs lbh qvq fb jvgubhg ernqvat n znahny svefg.
    -- vs lbh hfrq OFQ pnrfne ;)
        - Whreq
    

Re: Using LWP::Parallel
by hossman (Prior) on Mar 09, 2002 at 00:47 UTC
    Maybe i'm missing something, but in neither of the 2 replies posted so far do i see anything that seems to address wilstephens's specific question about limiting to amount of parrallellisssmmmm to a pecific number (his example was 10).

    I've never used LWP::Parallel, but after a quick glance at the docs (the concept intrigues me) I notice this method for LWP::Parallel::RobotUA ...

    $pua->max_req ( 2); # max parallel requests per server
    but that's not quite what we're looking for.

    I can't for the life of me see anything obvious here. I would guess you could do this by only registering 10 URLs at a time, getting them in parrallel, then registering another 10, etc. But this is wastefull of cycles. Some URLs will finish faster then others, but you're wiating for all 10 before proceeding to #11.
    (I would guess there is a better way)

    UPDATE: LWP::Parallel::UserAgent also seems to have a max_req method, as well as a max_hosts method -- so if you know that all of your URLs wer for different hosts, OR were all for the same host, you could use one of them to get your result, but i still don't see anyway to optimally fetch no more then 10 URLs a time without any pre previous knowledge of your URLs. method

      Actually, it's quite trivial to get the POE solution to fetch 10 documents at a time; you just create 10 sessions which continually (get a URL off a master list, then fetch that document).
        I wouldn't call it trivial if people have no experience with POE. I for one didn't understand from your example how you were suggesting it would help him towards his goal of retrieving no more then 10 at a time.

        After reading the POE, POE::Session, & POE::Component::Client::UserAgent I *think* what you're suggesting is that the poster modifying 02multi.t so that this...

        POE::Session -> create ( inline_states => { _start => \&_start, _stop => \&_stop, response => \&response, _signal => \&_signal }, );
        becomes this...
        for (my $i = 0; $i < 10; $i++) { POE::Session -> create ( inline_states => { _start => \&_start, _stop => \&_stop, response => \&response, _signal => \&_signal }, ); }
        and change _start to only loop over a 1/10 of @urls.

        Does that sound about right?

        Frankly, i'm still not clear on why this can't be done in a more straight forward manner with LWP::Parallel directly, It has the functionality to limit the number of parrallel requests to an individual server -- OR to limit the number of different servers it sends requests to at the same time, ... why isn't there a more general way to limit the TOTAL number of parrallel requests?

Re (tilly) 1: Using LWP::Parallel
by tilly (Archbishop) on Mar 09, 2002 at 03:27 UTC
    This is not an answer to your question. Rather it is a comment for everyone who answered it without noticing this very important detail.

    What you have written is a robot. It is very bad netiquitte to not look for and respect robots.txt. Any time anyone asks a question where it is clear from their code that they have written a robot that doesn't do this, please make sure to bring up robots.txt, and point to WWW::RobotRules. (Which comes with LWP.)

      Given that:
      • The User agent is "OpticDB LinkCheck/0.1"
      • The list of links pinged is stored in a Database
      • The pages aren't being scraped for more links
      The assumption that it's a Robot seems misplaced. Seems more like a site analyzer to me. (ie: check that 'important' urls are working.)
        Thank you for all your replies! After some more help from c.l.p.m, I've managed to reach thus far (code below). However, some questions remains unanswered, ie, the need to limit the number of parallel connections.

        And for the record, this is just a script to check if the links I have in a database are correct, ie, I'm checking for a return status code of 200, and if not, grab the status code so I can delete them from the database.

        sub check_links_results { print $query->header; use LWP::Parallel::UserAgent qw(:CALLBACK); my $ua = LWP::Parallel::UserAgent->new; $ua->nonblock(1); $ua->agent("OpticDB LinkCheck/0.1"); connect_to_db(); my $clock_start = time(); $sth = $dbh->prepare("SELECT url_en,id FROM $DB_MYSQL_NAME"); $sth->execute (); my %ids; while( my ($url, $id) = $sth->fetchrow_array ) { $ids{$url} = $id; $ua->register(HTTP::Request->new(GET => $url)); } $sth->finish; $dbh->disconnect; my $responses = $ua->wait; my $clock_finish = time - $clock_start; # end timer + and compare $time_taken = sprintf ("%.2f", $clock_finish); # trim time +to 2 decimal points my ($count, $htmlout) = (0, ""); while( (undef, my $entry) = each %$responses ) { my $req = $entry->request; my $res = $entry->response; my $id = $ids{$req->url}; next if $res->code == 200; ++$count; $tmpl_show_record .= qq| <table width="95%" border="0" cellspacing="0" cell +padding="2"> <tr> <td width="2%" align="middle">&nbsp;</td> <td width="6%" bgcolor="#EEEECC" align="right" val +ign="top"><font face="Arial, Helvetica, sans-serif" size="2">$ref->{i +d}</font>&nbsp;</td> <td width="58%" bgcolor="#E9EBEF">&nbsp;<font face +="Arial, Helvetica, sans-serif" size="2">$ref->{'name_en'}</font></td +> <td width="20%" bgcolor="#FFDDDD">&nbsp;<font face +="Arial, Helvetica, sans-serif" size="2">$res_code : $res_msg</font>< +/td> <td width="14%" bgcolor="#EEEECC" valign="top" ali +gn="center"><a href="odb.cgi?action=edit_record&id=$ref->{'id'}"><img + src="/images/icons/edit.gif" width="15" height="15" alt="[ edit ]" b +order="0"></a> &nbsp; <a href="odb.cgi?action=del_record&id=$ref->{'id'} +" onClick="return confirm('Delete record $ref->{'id'}?')"><img src="/ +images/icons/delete.gif" width="15" height="15" alt="[ delete ]" bord +er="0"></a> &nbsp; <a href="odb.cgi?action=toggle_live&id=$ref->{'id' +}"> |; if ($data_status eq "Live") { $tmpl_show_record .= "<img src=\"/images/icons/liv +eyes.gif\" border=\"0\">"; } else { $tmpl_show_record .= "<img src=\"/images/icons/liv +eno.gif\" border=\"0\">"; } $tmpl_show_record .= qq| </a> </td> </tr> </table> <BR> |; } $num_dead = $count; if( $count == 0 ) { &error_html("No dead links found!"); exit; } $dbh->disconnect; &parse_template("$PATH_TEMPLATE/check_links_results.tmpl"); }


        --
        Wiliam Stephens <wil@stephens.org>