hacker has asked for the wisdom of the Perl Monks concerning the following question:

This is more of a theory question, but I'm curious what would be the faster of the three to integrate into a recursive screen-scraping engine; LWP::Parallel, HTTP::GHTTP, or raw IO::Socket code wrapped around some regexen and URI/URI::URL code..

I'm doing some benchmarking of LWP::Parallel::UserAgent right now, and noticing that successive tests against the same site yield some very different results. Could this be site-related? Local bandwidth-related? Module-related?

Along these lines, what would be the fastest way to get content which passes a 200 or 302 response code (assuming the site is up, responding, and the content is valid)?

Here's what I have so far to test this:

use strict; use Data::Dumper; use LWP::Parallel::UserAgent; use HTTP::Request; use HTML::SimpleLinkExtor; use LWP::Parallel::Protocol::http; *LWP::Parallel::UserAgent::_new_response = \&LWP::UserAgent::_new_response; my $pagecount = 1; my $url = $ARGV[0]; my $request = HTTP::Request->new(GET => $url); my $browser = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b) Gecko/20030514'; my $ua = new LWP::UserAgent; $ua->agent($browser); # debugging messages. See 'perldoc LWP::Debug' # use LWP::Debug qw(+); my $response = $ua->request($request); my $status_line = $response->status_line; my $html = $response->content; my $extor = HTML::SimpleLinkExtor->new(); $extor->parse($html); my @img_srcs = $extor->img; my @a_hrefs = $extor->a; my @base_hrefs = $extor->base; undef my %saw; my @out = grep(!$saw{$_}++, @a_hrefs); my @urls; my $uri = URI->new($url)->canonical; my $host = $uri->host($url); my $g_scheme = $uri->scheme; foreach my $site (@out) { my $p_uri = URI->new($site)->canonical; my $p_scheme = $p_uri->scheme; if ($p_scheme !~ /http/) { $site =~ s,^//,http://,; $site = "$g_scheme://$host/$site\n\n"; } push @urls, $site; } my $reqs = [ map { HTTP::Request->new('GET', $_ ) } @urls ]; my $pua = LWP::Parallel::UserAgent->new(); $pua->in_order (0); $pua->duplicates(1); $pua->timeout (2); $pua->max_req (100); $pua->max_hosts (100); $pua->redirect (1); my $urlcount = 0; foreach my $req (@$reqs) { if ( my $res = $pua->register ($req) ) { print STDERR $res->error_as_HTML; } $urlcount++; } print "Total valid (unique) urls found: $urlcount\n\n"; my $entries = $pua->wait(); foreach (keys %$entries) { my $res = $entries->{$_}->response; my $html = $res->content; print "Fetching link $pagecount\n\n"; open FILE, ">$pagecount.html" or die $!; print FILE $html; close FILE; $pagecount++; }

This code snippet works as-is (there's much more code not included in this, irrelvant for this node), but running this code multiple times against a host, one after the other, seems to yield very different fetch times. The log also seems to periodically report some warnings (errors?) about being out of bandwidth. I've got plenty, and the sites I'm hitting are very small with very small Content-Length values.

LWP::Parallel::UserAgent::_check_bandwith: No open request-slots available LWP::Parallel::UserAgent::_make_connections_unordered: Not enough bandwidth for request

Thoughts? Comments? Suggestions?

Replies are listed 'Best First'.
Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket
by ajt (Prior) on May 16, 2003 at 07:55 UTC

    hacker

    I can't comment on LWP::Parallel directly, but I did some crude benchmarks on LWP, HTTP::GHTTP, HTTP::Lite and HTTP::MHTTP. What I found was hardly surprising:

    • LWP is a big slow to load module, and once it's loaded, it's still pretty slow. It can do just about anything, but it's not a speed demon.
    • Lite is quicker than LWP to load, and quicker in use, but it's still not what you would call fast.
    • GHTTP as expected was fast to load, and fast in use. Much faster than either of the pure Perl modules. I can't get it to work under mod_Perl on Windows, but that's my only complaint.
    • MHTTP was the only surprise. It has the most basic API, it's not object orientated like the others, but it's even faster than GHTTP - in both module load time, and in actual use..

    UPDATE: It should be possible to compile the two c based modules on Windows, GHTTP and MHTTP. I believe that currently only GHTTP has a precompiled PPM available. Building the module on Windows, is just a case of asking a nice person with a compiler to do the work for you - CrazyPPM repository, interested?. I've recently spoken with Piers, and if you have any bugs to submit for MHTTP, let him know and he'll have a look at the for you.


    --
    "It's not magic, it's work..."
    ajt
      Along these lines, would it be faster to use LWP::Parallel, even though it is a bit heavier and slower, to fetch requests in parallel, or to use something like HTTP::MHTTP and fork() or Thread, and grab the requests from @urls one-at-a-time?

      My concern here is that I'll have an array and some hashes that have urls that are seen, unseen, down, bad, and so on.. and I need to make sure that the process putting urls into the hashes and arrays (as links are yanked from the pages in %seen) can be fetched by processes already in fork() or registered in parallel. Would this require some sort of shared memory to get working properly? Can a forked process read and write to an array or hash created by the parent of the fork?

      I've got a lot of this code "functioning", but now is the time to refactor and get the performance up to speed (pun intended) for a production distribution of the tool.

        You can use Parallel::ForkManager to parallelize HTTP::MHTTP or HTTP::GHTTP calls easily and apply a limit to the maximum number of child processes.

        There are a number of ways to handle getting the retrieved data back to the parent or other process that don't require use of shared memory:

Re: LWP::Parallel vs. HTTP::GHTTP vs. IO::Socket
by perrin (Chancellor) on May 16, 2003 at 01:41 UTC
    Most likely HTTP::GHTTP. It is much faster than standard LWP, and I doubt you will beat it with any custom socket code of your own in perl. In my experience LWP::Parallel is slow and hard to use.