Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks,

I hope someone can help me, I have quite a simple question, but I don't know how to solve it.

I am writing a web spider for my degree dissertation, don't worry, I have used the RobotUA module, so it's nice a friendly to peoples web sites. The problem i'm having is trying to determine the type of files i'm retrieving.

For example, if I send a request for
http://www.php.net/index.php
I can easily tell this is a PHP file because it has a filename included.

But what if I send a request for
http://www.perlmonks.org/
how can I tell the files MIME type from just this?

Is there someway of returning the MIME type of file I am retrieving? Any help would be very appreciated, thanks Tom

Replies are listed 'Best First'.
•Re: MIME types
by merlyn (Sage) on Mar 30, 2002 at 18:34 UTC
    For example, if I send a request for http://www.php.net/index.php I can easily tell this is a PHP file because it has a filename included.
    You are confused. A "PHP File" is not a MIME type. This URL may return a plain text file, an image, an MPEG, a PDF, or even HTML.

    Repeat after me:

    • "There is no necessary correlation between a URL and the content type it serves."
    Now, to answer your next question, the LWP module can give you access to the Content-type header of the response, which is where you find the real "MIME type". It'll look like text/plain or image/jpeg. None of those include the letters "PHP", by the way. {grin}

    -- Randal L. Schwartz, Perl hacker

      thanks for the response,

      yeah I thought I probably got the MIME types thing a bit confused, so I guess the MIME type just tells the type of document right, i.e. the content type. "There is no necessary correlation between a URL and the content type it serves." ok I got that, but unfortunately I still have my problem...

      you see i'm trying to find out the different technologies that are used on different sites, i.e. the number of sites using Perl/PHP/ASP, I can tell this by the file extension, i.e. index.php or index.pl or index.asp etc, so if I just send a request for http://www.google.com/ how can I get the extension of the file it returns...? I just need to get the filename of the returned file, or something like that, if anyone knows a better way to do this great :-)

      I hope that made my question a little clearer, thanks again!

      btw, is it true that Perl is the only programming language that looks the same before and after RSA encryption?! :-)

        Keep in mind that the extension isn't a guarantee that the site is running NT, ASP, straight HTML, JSP or your favorite other technology; it's very easy to tell Apache, for example, that all files ending with ".asp" should be parsed as PHP files.

        Not only that, but using a module like HTML::Mason preparses HTML files. I believe this is the recommended functionality.

        Just to be a little different once, i went and renamed all my .pl CGI's to .html.

        I dont think there is any clear and definitive method of determining what technology a site actually uses.

        The answer to your second question, nope, machine code... :-)

        I can tell this by the file extension
        No, you can't. Please stop hallucinating. The information you want is not available.

        -- Randal L. Schwartz, Perl hacker

Re: MIME types
by JayBonci (Curate) on Mar 30, 2002 at 21:11 UTC
    If you are going to use LWP to get files from that server, you can get information back from the HTTP header.

    Look into the HTTP::Response module for your user agent. From there call my $hdrs = $response->headers(); to get the HTTP::Headers object from the server.

    From there you'll be looking to use
    $hdrs->content_type; # this returns a lowercase string
    Hope this will get you on your way. Good luck with the spider.

        --jb