Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have to write a daemon that will fetch URLs from different servers (each document will be no more than 100kb) in parallel, at least 20 requests running concurently (the more, the better).

I have to choose what modules to use. I've heard LWP is slow and too CPU-intensive task for crawlers (at least this is told in WWW::Curl::Multi suggesting to use itself for crawlers), while WWW::Curl::Multi is broken (I've reported bugs in RT).

What options do I have besides LWP? I'm considering to use threads and WWW:Curl::Easy to run downloader inside each thread.

This has to run on Linux. Ideally it should run on Virtual Private Server (hoster permits running spiders there), if possible (so don't answer "use LWP and buy server with 8-core Intel CPU).

Thanks in advance for your answers!

  • Comment on what modules you recommend for downloading hundreds of URLs per second in parallel?

Replies are listed 'Best First'.
Re: what modules you recommend for downloading hundreds of URLs per second in parallel?
by zentara (Cardinal) on Jun 13, 2008 at 12:06 UTC
    You could use pure Sockets, see Fetching HTML Pages with Sockets for the basic idea. Here is a working code bit, that you can remove the extra fluff from. It probably has some code that needs improving too.
    #!/usr/bin/perl use warnings; use strict; use Socket; #dosn't work well for images, but you can fix that my $url = "http://zentara.net/index.html"; my $infile = $url; $infile =~ tr#\/#-#; print $infile; my $host = "zentara.net"; $| = 1; my $start = times; my ( $iaddr, $paddr, $proto ); $iaddr = inet_aton($host); #$iaddr = ( gethostbyname($host) )[4]; $paddr = sockaddr_in( 80, $iaddr ); $proto = getprotobyname('tcp'); unless ( socket( SOCK, PF_INET, SOCK_STREAM, $proto ) ) { die "ERROR Dude: getUrl socket: $!"; } unless ( connect( SOCK, $paddr ) ) { die "getUrl connect: $!\n"; } my @head = ( "GET $url HTTP/1.0", #maybe better to use 1.0, instead of 1.1 for + "no keep-alive" ?? "User-Agent: Mozilla/4.78 [en] (X11; U; Safemode Linux i386)", "Pragma: no-cache", "Host: $host", "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, imag +e/png, */*", "Accept-Language: en" ); push( @head, "", "" ); #Build Header and print to socket my $header = join( "\015\012", @head ); print "sending request\n$header\n\n"; select SOCK; $| = 1; binmode SOCK; print SOCK $header; my $body = ''; open (FH,"> $infile") or warn "$!\n"; while (<SOCK>) { my $data = $_; $data =~ s/[\r\n\t]+$//s; $data =~ s/^[\r\n\t]+//s; last if $data =~ /^0$/s; my $len = length($data); print STDOUT "len:$len\n"; $body .= $data; last if $data =~ /\<\/html\>$/is; if ( $data =~ /\<\/body\>$/is ) { $body .= qq|</html>|; last; } print FH $data; } unless ( close(SOCK) ) { return ("getUrl close: $!"); } select STDOUT; close SOCK; close FH; my $end = times; my $diff = $end = $start; print "Took $diff to access page\n";

    I'm not really a human, but I play one on earth CandyGram for Mongo
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: what modules you recommend for downloading hundreds of URLs per second in parallel?
by moritz (Cardinal) on Jun 13, 2008 at 12:45 UTC
    I've heard LWP is slow and too CPU-intensive task for crawlers

    Then the first step should be to test that. Maybe it was too slow for somebody on his own, weak machine, but it's no issue for you.

    If it's to slow for you, try search cpan for spider and crawler, maybe some of the results might help you.

    If you care very much for CPU time, consider using curl or wget, which are written in C and probably less CPU intensive.

Re: what modules you recommend for downloading hundreds of URLs per second in parallel?
by Arunbear (Prior) on Jun 13, 2008 at 15:18 UTC
    Try Gungho. Or you could invoke an external crawler like Httrack (written in C, very fast).