use POE;
by mugwumpjism (Hermit) on Mar 08, 2002 at 13:32 UTC
|
You might want to check out POE::Component::Client::UserAgent. One of the test scripts (t/02-multi.t) does pretty much what you want.
Update 11/03/02: Here is a much shorter script, that at least demonstrates fetching a list of documents, and uses 10 threads. Liberal adding of "print" statements to assist understanding of the code is left as an exercise for the reader.
#!/usr/bin/perl -w
use strict;
use POE;
use POE::Component::Client::UserAgent;
my @urls = map { "http://sam.vilain.net/$_" }
qw(index.html hobbies.html CV foo bar);
sub _start {
$_[HEAP]->{alias} = "useragent".$_[ARG0];
POE::Component::Client::UserAgent->new
(alias => $_[HEAP]->{alias});
$_[KERNEL]->yield("next");
}
sub next {
if (my $url = pop @urls) {
$_[KERNEL]->post (
$_[HEAP]->{alias} => "request",
{
request => HTTP::Request->new(GET => $url),
response => $_[SESSION]->postback('next')
}
);
} else {
$_[KERNEL]->post (
$_[HEAP]->{alias} => "shutdown",
);
}
}
for (1..10) {
POE::Session -> create
(
inline_states => {
_start => \&_start,
next => \&next,
},
args => [ $_ ], # arguments
);
}
$poe_kernel->run();
PS LWP::Parallel::UserAgent 2.51 has a bug;
- The line 182 of LWP::Parallel::UserAgent.pm needs to be
my $self = new LWP::UserAgent;
instead of
my $self = new LWP::UserAgent $init;
- You may also need to insert the line
use HTML::HeadParser;
anywhere near the top of LWP::Parallel::Protocol.pm.
| [reply] [d/l] [select] |
Re: Using LWP::Parallel
by Juerd (Abbot) on Mar 08, 2002 at 10:54 UTC
|
...
while (my $ref = $sth->fetchrow_hashref) {
$pua->register( HTTP::Request->new(GET => $ref->{'url_en'}) );
}
my @results = $pua->wait();
for (@results) {
my $res = $_->response;
...
}
...
Good luck!
++ vs lbh qrpbqrq guvf hfvat n ge va Crey :)
Nabgure bar vs lbh qvq fb jvgubhg ernqvat n znahny svefg.
-- vs lbh hfrq OFQ pnrfne ;)
- Whreq
| [reply] [d/l] |
Re: Using LWP::Parallel
by hossman (Prior) on Mar 09, 2002 at 00:47 UTC
|
Maybe i'm missing something, but in neither of the 2
replies posted so far do i see anything that seems to
address wilstephens's specific question about limiting to amount of parrallellisssmmmm to a pecific number
(his example was 10).
I've never used LWP::Parallel, but after a quick glance at the docs (the concept intrigues me) I notice this method for LWP::Parallel::RobotUA ...
$pua->max_req ( 2); # max parallel requests per server
but that's not quite what we're looking for.
I can't for the life of me see anything obvious here.
I would guess you could do this by only registering
10 URLs at a time, getting them in parrallel, then
registering another 10, etc. But this is wastefull of
cycles. Some URLs will finish faster then others, but
you're wiating for all 10 before proceeding to #11.
(I would guess there is a better way)
UPDATE: LWP::Parallel::UserAgent also seems to
have a max_req method, as well as a
max_hosts method -- so if you know that all
of your URLs wer for different hosts, OR were all for the
same host, you could use one of them to get your result,
but i still don't see anyway to optimally fetch no more
then 10 URLs a time without any pre previous knowledge of
your URLs.
method | [reply] [d/l] [select] |
|
|
Actually, it's quite trivial to get the POE solution to fetch 10 documents at a time; you just create 10 sessions which continually (get a URL off a master list, then fetch that document).
| [reply] |
|
|
POE::Session -> create (
inline_states => {
_start => \&_start,
_stop => \&_stop,
response => \&response,
_signal => \&_signal
},
);
becomes this...
for (my $i = 0; $i < 10; $i++) {
POE::Session -> create (
inline_states => {
_start => \&_start,
_stop => \&_stop,
response => \&response,
_signal => \&_signal
},
);
}
and change _start to only loop over a 1/10
of @urls.
Does that sound about right?
Frankly, i'm still not clear on why this can't be done in
a more straight forward manner
with LWP::Parallel directly, It has the functionality
to limit the number of parrallel
requests to an individual server -- OR to limit the number
of different servers it sends requests to at the same time,
... why isn't there a more general way to limit the TOTAL number of parrallel requests?
| [reply] [d/l] [select] |
|
|
Re (tilly) 1: Using LWP::Parallel
by tilly (Archbishop) on Mar 09, 2002 at 03:27 UTC
|
This is not an answer to your question. Rather it is a
comment for everyone who answered it without noticing this
very important detail.
What you have written is a robot. It is very bad
netiquitte to not look for and respect robots.txt. Any
time anyone asks a question where it is clear from their
code that they have written a robot that doesn't do this,
please make sure to bring up robots.txt, and point
to WWW::RobotRules. (Which comes with LWP.) | [reply] |
|
|
Given that:
- The User agent is "OpticDB LinkCheck/0.1"
- The list of links pinged is stored in a Database
- The pages aren't being scraped for more links
The assumption that it's a Robot seems misplaced. Seems more like a site analyzer to me. (ie: check that 'important' urls are working.)
| [reply] |
|
|
Thank you for all your replies! After some more help from c.l.p.m, I've managed to reach thus far (code below). However, some questions remains unanswered, ie, the need to limit the number of parallel connections.
And for the record, this is just a script to check if the links I have in a database are correct, ie, I'm checking for a return status code of 200, and if not, grab the status code so I can delete them from the database.
sub check_links_results {
print $query->header;
use LWP::Parallel::UserAgent qw(:CALLBACK);
my $ua = LWP::Parallel::UserAgent->new;
$ua->nonblock(1);
$ua->agent("OpticDB LinkCheck/0.1");
connect_to_db();
my $clock_start = time();
$sth = $dbh->prepare("SELECT url_en,id FROM $DB_MYSQL_NAME");
$sth->execute ();
my %ids;
while( my ($url, $id) = $sth->fetchrow_array ) {
$ids{$url} = $id;
$ua->register(HTTP::Request->new(GET => $url));
}
$sth->finish;
$dbh->disconnect;
my $responses = $ua->wait;
my $clock_finish = time - $clock_start; # end timer
+ and compare
$time_taken = sprintf ("%.2f", $clock_finish); # trim time
+to 2 decimal points
my ($count, $htmlout) = (0, "");
while( (undef, my $entry) = each %$responses ) {
my $req = $entry->request;
my $res = $entry->response;
my $id = $ids{$req->url};
next if $res->code == 200;
++$count;
$tmpl_show_record .= qq|
<table width="95%" border="0" cellspacing="0" cell
+padding="2">
<tr>
<td width="2%" align="middle"> </td>
<td width="6%" bgcolor="#EEEECC" align="right" val
+ign="top"><font face="Arial, Helvetica, sans-serif" size="2">$ref->{i
+d}</font> </td>
<td width="58%" bgcolor="#E9EBEF"> <font face
+="Arial, Helvetica, sans-serif" size="2">$ref->{'name_en'}</font></td
+>
<td width="20%" bgcolor="#FFDDDD"> <font face
+="Arial, Helvetica, sans-serif" size="2">$res_code : $res_msg</font><
+/td>
<td width="14%" bgcolor="#EEEECC" valign="top" ali
+gn="center"><a href="odb.cgi?action=edit_record&id=$ref->{'id'}"><img
+ src="/images/icons/edit.gif" width="15" height="15" alt="[ edit ]" b
+order="0"></a>
<a href="odb.cgi?action=del_record&id=$ref->{'id'}
+" onClick="return confirm('Delete record $ref->{'id'}?')"><img src="/
+images/icons/delete.gif" width="15" height="15" alt="[ delete ]" bord
+er="0"></a>
<a href="odb.cgi?action=toggle_live&id=$ref->{'id'
+}">
|;
if ($data_status eq "Live") {
$tmpl_show_record .= "<img src=\"/images/icons/liv
+eyes.gif\" border=\"0\">";
}
else {
$tmpl_show_record .= "<img src=\"/images/icons/liv
+eno.gif\" border=\"0\">";
}
$tmpl_show_record .= qq|
</a>
</td>
</tr>
</table>
<BR>
|;
}
$num_dead = $count;
if( $count == 0 ) {
&error_html("No dead links found!");
exit;
}
$dbh->disconnect;
&parse_template("$PATH_TEMPLATE/check_links_results.tmpl");
}
--
Wiliam Stephens <wil@stephens.org> | [reply] [d/l] |
|
|