Re: Identifying PDF from URLs

First thing: in general, you cannot determine a mime type of a resource by just looking at a URL. In fact, given a fixed URL, a server might give you a PNG image, HTML document, a random stream of bytes, a PDF document or a 404 error, depending on a role of a die.

You can do a request for the resource, and look at the HTTP header, to see what the server claims the MIME type of the resource is. If you trust the server(s), this may be enough for you. Else, you will actually have to download the resource, and inspect it. You could look at the magic bytes and determine the file type from that (PDF files start with %PDF-, so you only need to first 5 bytes of the resource) - but even that may not be enough. It's only a proper PDF file if the entire file has the correct syntax. For that, you'd need to download the entire source and parse it.

So, the summarize: you cannot determine the document format from the URL alone - you'll have to query the server. Depending on your level of trust, you need either the HTTP header, the first bytes of the resource, or the entire resource to determine its MIME type.

Comment on Re: Identifying PDF from URLs