http://qs1969.pair.com?node_id=910527

sarvan has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

I am newbie to this forum. I hope that i found a better forum for Q&A. I have question. Is it possible to check the url whether it contains a document like pdf or its just a webpage link??

I used Lwp::simple to find the existence of the url but doubt in this..

Any help is highly appreciated. Thanks..

Replies are listed 'Best First'.
Re: LWP::Simple to judge the url
by moritz (Cardinal) on Jun 20, 2011 at 09:22 UTC
    If you're not fixed on LWP::Simple, here's an example with Mojolicious:
    use Mojo::UserAgent; my $url = "http://de.arxiv.org/pdf/1106.3541" print Mojo::UserAgent->new->head($url)->res->headers->content_type;

    This prints application/pdf, indicating that document returned from this URL is a PDF file.

      (and if you don't have Mojolicious installed, LWP::UserAgent does it as well:)

      use LWP::UserAgent; my $url = "http://de.arxiv.org/pdf/1106.3541"; print LWP::UserAgent->new->head($url)->headers->content_type()
Re: LWP::Simple to judge the url
by Corion (Patriarch) on Jun 20, 2011 at 09:18 UTC

    See the ->head method of LWP::UserAgent. The Content-Type header of the response should tell you what content the page sends back.

      LWP::Simple does head just as well. And it might be simpler to use.
      head($url)
      Get document headers. Returns the following 5 values if successful: ($content_type, $document_length, $modified_time, $expires, $server)

      Returns an empty list if it fails. In scalar context returns TRUE if successful.

Re: LWP::Simple to judge the url
by ww (Archbishop) on Jun 20, 2011 at 15:36 UTC
    The excellent replies above quite satisfactorily answer the question asked.

    But the question itself strikes me as a bit odd, in an age when mislabeled or unlabled internet content should probably be regarded as suspect/undesireable/dangerous.

    OP's test for existance tells the name (and -- in some cases -- the nominal file.typ) of the target of the link. The answers above tell how to find out the actual type of file whether or not (OP's case) an extension is provided on the server.

    OTOH, were one to rely on a browser, clicking a mislabeled link might provide perhaps as little info as "binary" (try this on a MSWord doc mislabeled as doc.foo, with FF under linux); perhaps misleading info on the actual type (content) of the file (try opening a .pdf mislabeled as an .xls, under w32).

    Is there an X/Y problem here or am I missing some reasonable basis for the question?

      Could the question be some sort of test or homework?

      A few weeks back I was sent a series of about 10 perl questions by a potential employer, with no real time limit on them.(*) Two of the questions where about checking if URLs worked, and what file type was at the end of them, so rather similar to this question.

      * I was asked to bring answers to a job interview a week or so later.