LWP: Alternative to using HEAD 1st and then GET? (content sensing?)

isync has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks!

I am building a small crawler here to index and check our intranet. And as with so much html resources these days, a *lot* of the URIs lack the trailing .html, .htm, .asp extensions so in order not to fetch our binary data (as well without proper extensions...) I do a sequence of a HEAD request first and then, if the content is text/*, a GET.

But this puts double stress on the server and is a hack. Is there a way to detect or sense content type without downloading loads of binary data first?

The code without all the details...

my $head_request  = HTTP::Request->new(HEAD => $url);
my $head_response = $agent->request($head_request);
...
my $response = $agent->get($url);
[download]

I am not so much into LWP callbacks, but is callbacks the proper way to go? Or, is there a way to advise LWP to first download a few bytes and then (how?) check if this is binary data or text/* (but what about utf8, which is binary!).
How does Mozilla handle this...?

Comment on LWP: Alternative to using HEAD 1st and then GET? (content sensing?) Download Code

Replies are listed 'Best First'.
Re: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by BrowserUk (Patriarch) on Aug 14, 2007 at 12:10 UTC
Add an Accept header to the request. Eg. `Accept: text/*` See the RFC. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by isync (Hermit) on Aug 14, 2007 at 12:25 UTC
It's already in my code: `$agent->default_headers->header('Accept' => 'text/*');` [download] But the server (not configured by me) spits our some binary data (rarely) from some gory scripts which print false headers... To make a long story short: I didn't want to rely on headers and make it bulletproof. Any way to do it? And, how does Mozilla do it? (they know how to deal with incorrectly flagged content...)	[reply] [d/l]
Re^3: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by BrowserUk (Patriarch) on Aug 14, 2007 at 12:40 UTC
Maybe use a `:content_cb` callback and inspect the first chunk using tr or a regex looking for 'binary values' and abandon the request if you find them? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by isync (Hermit) on Aug 14, 2007 at 20:37 UTC
Re^5: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by BrowserUk (Patriarch) on Aug 14, 2007 at 21:41 UTC
Re^5: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by daxim (Curate) on Aug 17, 2007 at 17:20 UTC
Re^3: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by jbert (Priest) on Aug 14, 2007 at 13:04 UTC
If you don't want to rely on header information, you're doing to have to start fetching the content. So just close the connection in the (rare) case you see binary data, the server will get an EPIPE/SIGPIPE/whatever windows does in this case and get on with it's life.	[reply]
Re: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by ForgotPasswordAgain (Vicar) on Aug 14, 2007 at 11:56 UTC
Double stress, hah. Show us your benchmarks.	[reply]
Re^2: LWP: Alternative to using HEAD 1st and then GET? (content sensing?) by isync (Hermit) on Aug 14, 2007 at 12:05 UTC
ok, this is no heavy impact on the server, but a hack anyway. Any better tips, other than your short remark?	[reply]