Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^2: Speeding things up -- LWP::Parallel

by mr_ron (Chaplain)
on Apr 30, 2018 at 00:48 UTC ( #1213776=note: print w/replies, xml ) Need Help??

in reply to Re: Speeding things up -- LWP::Parallel
in thread Speeding things up

I don't have much experience with Mojo::UserAgent, but the Mojo Cookbook has examples for both non-blocking and blocking concurrency. For a simple test I got blocking concurrency with promises working. I built a small test server to avoid any webmaster complaints and easily read and did a little parsing on about 100 fetched URLs per second.

I put a 1 second delay into page delivery and performance seemed to depend on web server configuration. To get good performance I configured for 100 workers restricted to 1 connection each. So with a server named I ran:

./ prefork -w 100 -c 1

Test server:

#!/usr/bin/env perl use Modern::Perl; use Mojolicious::Lite; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; # cache page to echo my $res = $ua->get('')->result; if ($res->is_error) { say $res->message } elsif (not $res->is_success) { die "Unknown response from Mojo::UserAgent" } $res->dom->at('head')->append_content( '<base href="">' ); get '/' => sub { my ($c) = @_; sleep 1; $c->render(text => $res->dom->to_string, format => 'html'); }; app->start;

Test client with blocking concurrency from promises:

#!/usr/bin/env perl use Modern::Perl; use Mojo::UserAgent; my @all_url = ('') x 200; my $concurrent_load = 100; my $ua = Mojo::UserAgent->new; while (@all_url) { my @concurrent_read = map { $ua->get_p($_)->then(sub { my $tx = shift; my $result = $tx->result; if ($result->is_success) { say $result->dom->at('title')->text } else { say $result->is_error ? $result->message : "Unknown response from Mojo::UserAgent"; } }) # end ->then sub } splice @all_url, 0, $concurrent_load; Mojo::Promise->all(@concurrent_read)->wait; }

Replies are listed 'Best First'.
Re^3: Speeding things up -- LWP::Parallel
by AlwaysSurprised (Novice) on Jun 16, 2018 at 23:54 UTC

    Although 100/sec sounds fun, I rather think it would look like some sort of feeble DDoS attack and get my IP blocked. I've read 5/sec is considered high by some spider writers. Apparently you can register your site with Google and set a parameter to limit it's strike rate though sometimes the Google spider just ignores it.

    I don't run a web server, but I bet the logs are just stuffed full of bots gathering pages.

    I do wonder what a polite rate is though; fast enough so that old results are still timely but slow enough to not be annoying.

      I don't run a web server, but I bet the logs are just stuffed full of bots gathering pages.

      I do and they are.

      Some web hosters don't care so YMMV. At $WORK we take a more heavy-handed approach. Large sources of traffic get noticed and investigated. Sources of bad traffic (any size) get investigated. If the results of the investigations warrant it, the bad actors get banned from the netork in its entirety.

      As for a polite rate, stick to 1 request per second at most and you won't even show up on anyone's radar. If you absolutely need more than that contact the site in advance as they probably have an API for whatever purpose it is you have. Do as you would be done by.

      Oh, and a single-IP DDoS is called a DoS. ;-)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1213776]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2022-12-05 07:15 GMT
Find Nodes?
    Voting Booth?

    No recent polls found