in reply to Re^2: Fast fetching of HTML response code
in thread Fast fetching of HTML response code
the time it takes to get the HEAD is about the same time it takes for the regular GET request,
That suggests that the time taken isn't the time required to transmit the page from the server to you; but rather the time it takes the server to prepare the page.
It's a fact of life that with the preponderance of dynamically generated content being served these days, the difference between HEAD and GET requests is minimal. For the most part, servers treat HEAD requests as GET requests but then throw away the generated page and only return the headers.
That means there is no way to what you asked for -- speed up the acquisition of the status code -- for individual pages.
As Corion pointed out elsewhere; your best bet to reducing the overall runtime is to issue multiple concurrent GETs and so overlap the server and transmission times of those multiple GETs with your local processing of the responses.
There are several ways of doing that. Corion suggested (one flavour of) the event-driven state machine method.
Personally,
#! perl -slw use strict; use Time::HiRes qw[ time ]; use LWP::Parallel::Useragent; use HTTP::Request; my $start = time; my $pua = LWP::Parallel::UserAgent->new(); $pua->timeout( 10 ); $pua->register( HTTP::Request->new( 'HEAD', "http://$_" ) ) while <>; my $entries = $pua->wait; printf "Took %.6f seconds\n", time - $start; __END__ c:\test>pua-head-urls urls.list Took 1333.616000 seconds
Here's a thread-pool implementation for reference:
#! perl -slw use threads stack_size => 4096; use threads::shared; use Thread::Queue; $|++; our $THREADS //= 10; my $count :shared = 0; my %log :shared; my $Q = new Thread::Queue; my @threads = map async( sub { our $ua; require 'LWP/Simple.pm'; LWP::Simple->import( '$ua', 'head' ); $ua->timeout( 10 ); while( my $url = $Q->dequeue() ) { my $start = time; my @info = head( 'http://' . $url ); my $stop = time; lock %log; $log{ $url } = $stop - $start; lock $count; ++$count; } } ), 1 .. $THREADS; require 'Time/HiRes.pm'; Time::HiRes->import( qw[ time ] ); my $start = time; while( <> ) { chomp; Win32::Sleep 100 if $Q->pending > $THREADS; $Q->enqueue( $_ ); printf STDERR "\rProcessed $count urls"; } $Q->enqueue( (undef) x $THREADS ); printf STDERR "\rProcessed $count urls" while $Q->pending and Win32::S +leep 100; printf STDERR "\nTook %.6f with $THREADS threads\n", time() - $start; $_->join for @threads; my( @times, $url, $time ); push @times, [ $url, $time ] while ( $url, $time ) = each %log; @times = sort{ $b->[1] <=> $a->[1] } @times; print join ' ', @$_ for @times[ 0 .. 9 ]; __END__ c:\test>t-head-urls -THREADS=30 urls.list Processed 2596 Took 43.670000 with 30 threads
More complex, but once you've reduce the overall runtime by overlapping the requests to the point where you saturate your connection bandwidth, then the time spent processing the responses locally starts to dominate.
Then the threads solution starts to come into its own because it efficiently and automatically utilises however many cores and CPU cycles are available, dynamically and transparently adjusting itself to fluctuations in the availability of those resources.
No other solution scales so easily, nor so effectively.
But you'll have to make up your own mind which approach suits your application and environment best.
|
|---|