Re^3: Fast fetching of HTML response code

the time it takes to get the HEAD is about the same time it takes for the regular GET request,

That suggests that the time taken isn't the time required to transmit the page from the server to you; but rather the time it takes the server to prepare the page.

It's a fact of life that with the preponderance of dynamically generated content being served these days, the difference between HEAD and GET requests is minimal. For the most part, servers treat HEAD requests as GET requests but then throw away the generated page and only return the headers.

That means there is no way to what you asked for -- speed up the acquisition of the status code -- for individual pages.

As Corion pointed out elsewhere; your best bet to reducing the overall runtime is to issue multiple concurrent GETs and so overlap the server and transmission times of those multiple GETs with your local processing of the responses.

There are several ways of doing that. Corion suggested (one flavour of) the event-driven state machine method.

Personally,

If I just needed to HEAD/GET a big list of urls, I'd use LWP::Parallel::UseraAgent so:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use LWP::Parallel::Useragent;
use HTTP::Request;

my $start = time;

my $pua = LWP::Parallel::UserAgent->new();
$pua->timeout( 10 );

$pua->register(  HTTP::Request->new( 'HEAD', "http://$_" ) ) while <>;

my $entries = $pua->wait;

printf "Took %.6f seconds\n", time - $start;

__END__
c:\test>pua-head-urls urls.list
Took 1333.616000 seconds
[download]

But if I need to do further processing on the responses, and especially if I needed to aggregate information from the responses together, then I'd use a pool-of-threads approach as I find it easier to reason about, easier to combine the results and it scales better (and automatically) to modern, multicore hardware.

Here's a thread-pool implementation for reference:

#! perl -slw
use threads stack_size => 4096;
use threads::shared;
use Thread::Queue;

$|++;

our $THREADS //= 10;

my $count :shared = 0;
my %log :shared;
my $Q = new Thread::Queue;

my @threads = map async( sub {
    our $ua;
    require 'LWP/Simple.pm';
    LWP::Simple->import( '$ua', 'head' );
    $ua->timeout( 10 );
    while( my $url = $Q->dequeue() ) {
        my $start = time;
        my @info = head( 'http://' . $url );
        my $stop = time;
        lock %log;
        $log{ $url } = $stop - $start;
        lock $count; ++$count;
    }
} ), 1 .. $THREADS;

require 'Time/HiRes.pm'; Time::HiRes->import( qw[ time ] );

my $start = time;

while( <> ) {
    chomp;
    Win32::Sleep 100 if $Q->pending > $THREADS;
    $Q->enqueue( $_ );
    printf STDERR "\rProcessed $count urls";
}

$Q->enqueue( (undef) x $THREADS );
printf STDERR "\rProcessed $count urls" while $Q->pending and Win32::S
+leep 100;
printf STDERR "\nTook %.6f with $THREADS threads\n", time() - $start;

$_->join for @threads;


my( @times, $url, $time );
push @times, [ $url, $time ] while ( $url, $time ) = each %log;

@times = sort{ $b->[1] <=> $a->[1] } @times;
print join ' ', @$_ for @times[ 0 .. 9 ];


__END__
c:\test>t-head-urls -THREADS=30 urls.list
Processed 2596
Took 43.670000 with 30 threads
[download]

More complex, but once you've reduce the overall runtime by overlapping the requests to the point where you saturate your connection bandwidth, then the time spent processing the responses locally starts to dominate.

Then the threads solution starts to come into its own because it efficiently and automatically utilises however many cores and CPU cycles are available, dynamically and transparently adjusting itself to fluctuations in the availability of those resources.

No other solution scales so easily, nor so effectively.

But you'll have to make up your own mind which approach suits your application and environment best.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re^3: Fast fetching of HTML response code Select or Download Code