isync has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks!

I am building a small crawler here to index and check our intranet. And as with so much html resources these days, a *lot* of the URIs lack the trailing .html, .htm, .asp extensions so in order not to fetch our binary data (as well without proper extensions...) I do a sequence of a HEAD request first and then, if the content is text/*, a GET.

But this puts double stress on the server and is a hack. Is there a way to detect or sense content type without downloading loads of binary data first?

The code without all the details...
my $head_request = HTTP::Request->new(HEAD => $url); my $head_response = $agent->request($head_request); ... my $response = $agent->get($url);

I am not so much into LWP callbacks, but is callbacks the proper way to go? Or, is there a way to advise LWP to first download a few bytes and then (how?) check if this is binary data or text/* (but what about utf8, which is binary!).
How does Mozilla handle this...?
  • Comment on LWP: Alternative to using HEAD 1st and then GET? (content sensing?)
  • Download Code

Replies are listed 'Best First'.
Re: LWP: Alternative to using HEAD 1st and then GET? (content sensing?)
by BrowserUk (Patriarch) on Aug 14, 2007 at 12:10 UTC
      It's already in my code:
      $agent->default_headers->header('Accept' => 'text/*');
      But the server (not configured by me) spits our some binary data (rarely) from some gory scripts which print false headers... To make a long story short: I didn't want to rely on headers and make it bulletproof. Any way to do it?

      And, how does Mozilla do it? (they know how to deal with incorrectly flagged content...)
        If you don't want to rely on header information, you're doing to have to start fetching the content.

        So just close the connection in the (rare) case you see binary data, the server will get an EPIPE/SIGPIPE/whatever windows does in this case and get on with it's life.

Re: LWP: Alternative to using HEAD 1st and then GET? (content sensing?)
by ForgotPasswordAgain (Vicar) on Aug 14, 2007 at 11:56 UTC

    Double stress, hah.

    Show us your benchmarks.

      ok, this is no heavy impact on the server, but a hack anyway. Any better tips, other than your short remark?