in reply to Speeding things up

Hello AlwaysSurprised and welcome to the monastery and (back!) to the wonderful world of perl!

do you want speed? parallelize your program! while i invite you to take a look to MCE unfortunately is not usefeul in this case because LWP::* modules are not thread safe.

But there is LWP::Parallel or LWP::Parallel::UserAgent if your connection is fast you must notice a big improvement.

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Speeding things up -- LWP::Parallel
by marioroy (Prior) on Apr 29, 2018 at 10:43 UTC

    Hi Discipulus,

    ... while i invite you to take a look to MCE unfortunately is not usefeul in this case because LWP::* modules are not thread safe ...

    An event-type module is typically preferred for this use-case. However, I took a look and tracked this down. See Re: Crash with ForkManager on Windows. When running parallel using LWP::*, it is essential to load IO::Handle and a couple Net::* modules before spawning workers. The latest MCE and MCE::Shared (MCE::Hobo) updates do this automatically if present, LWP::UserAgent.

    use LWP::Simple; # Pre-load essential modules for extra stability. if ( $INC{'LWP/UserAgent.pm'} && !$INC{'Net/HTTP.pm'} ) { require IO::Handle; require Net::HTTP; require Net::HTTPS; }

    Regards, Mario

Re^2: Speeding things up -- LWP::Parallel
by AlwaysSurprised (Novice) on Apr 15, 2018 at 21:59 UTC

    Some fiddling has been done. LWP & HTTP replaced with MOJO. I've now gone from ~50 sec to exmamine 100 pages to 16 sec aka 2 URL/s to 6 URL/s.

    I wonder how fast you have to hit Argos before it starts to think it's a DDoS? Anyone got any ideas how I can find out? And I don't mean bash it hard until it squeaks. That's just rude.

Re^2: Speeding things up -- LWP::Parallel
by mr_ron (Deacon) on Apr 30, 2018 at 00:48 UTC

    I don't have much experience with Mojo::UserAgent, but the Mojo Cookbook has examples for both non-blocking and blocking concurrency. For a simple test I got blocking concurrency with promises working. I built a small test server to avoid any webmaster complaints and easily read and did a little parsing on about 100 fetched URLs per second.

    I put a 1 second delay into page delivery and performance seemed to depend on web server configuration. To get good performance I configured for 100 workers restricted to 1 connection each. So with a server named dos-serve-sp.pl I ran:

    ./dos-serve-sp.pl prefork -w 100 -c 1

    Test server:

    #!/usr/bin/env perl use Modern::Perl; use Mojolicious::Lite; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; # cache page to echo my $res = $ua->get('www.software-path.com')->result; if ($res->is_error) { say $res->message } elsif (not $res->is_success) { die "Unknown response from Mojo::UserAgent" } $res->dom->at('head')->append_content( '<base href="http://www.software-path.com">' ); get '/' => sub { my ($c) = @_; sleep 1; $c->render(text => $res->dom->to_string, format => 'html'); }; app->start;

    Test client with blocking concurrency from promises:

    #!/usr/bin/env perl use Modern::Perl; use Mojo::UserAgent; my @all_url = ('http://127.0.0.1:3000/') x 200; my $concurrent_load = 100; my $ua = Mojo::UserAgent->new; while (@all_url) { my @concurrent_read = map { $ua->get_p($_)->then(sub { my $tx = shift; my $result = $tx->result; if ($result->is_success) { say $result->dom->at('title')->text } else { say $result->is_error ? $result->message : "Unknown response from Mojo::UserAgent"; } }) # end ->then sub } splice @all_url, 0, $concurrent_load; Mojo::Promise->all(@concurrent_read)->wait; }
    Ron

      Although 100/sec sounds fun, I rather think it would look like some sort of feeble DDoS attack and get my IP blocked. I've read 5/sec is considered high by some spider writers. Apparently you can register your site with Google and set a parameter to limit it's strike rate though sometimes the Google spider just ignores it.

      I don't run a web server, but I bet the logs are just stuffed full of bots gathering pages.

      I do wonder what a polite rate is though; fast enough so that old results are still timely but slow enough to not be annoying.

        I don't run a web server, but I bet the logs are just stuffed full of bots gathering pages.

        I do and they are.

        Some web hosters don't care so YMMV. At $WORK we take a more heavy-handed approach. Large sources of traffic get noticed and investigated. Sources of bad traffic (any size) get investigated. If the results of the investigations warrant it, the bad actors get banned from the netork in its entirety.

        As for a polite rate, stick to 1 request per second at most and you won't even show up on anyone's radar. If you absolutely need more than that contact the site in advance as they probably have an API for whatever purpose it is you have. Do as you would be done by.

        Oh, and a single-IP DDoS is called a DoS. ;-)