Greetings monks!
I am building a small crawler here to index and check our intranet. And as with so much html resources these days, a *lot* of the URIs lack the trailing .html, .htm, .asp extensions so in order not to fetch our binary data (as well without proper extensions...) I do a sequence of a HEAD request first and then, if the content is text/*, a GET.
But this puts double stress on the server and is a hack.
Is there a way to detect or sense content type without downloading loads of binary data first?
The code without all the details...
my $head_request = HTTP::Request->new(HEAD => $url);
my $head_response = $agent->request($head_request);
...
my $response = $agent->get($url);
I am not so much into LWP callbacks, but is callbacks the proper way to go? Or, is there a way to advise LWP to first download a few bytes and then (how?) check if this is binary data or text/* (but what about utf8, which is binary!).
How does Mozilla handle this...?