in reply to Identifying PDF from URLs
You can do a request for the resource, and look at the HTTP header, to see what the server claims the MIME type of the resource is. If you trust the server(s), this may be enough for you. Else, you will actually have to download the resource, and inspect it. You could look at the magic bytes and determine the file type from that (PDF files start with %PDF-, so you only need to first 5 bytes of the resource) - but even that may not be enough. It's only a proper PDF file if the entire file has the correct syntax. For that, you'd need to download the entire source and parse it.
So, the summarize: you cannot determine the document format from the URL alone - you'll have to query the server. Depending on your level of trust, you need either the HTTP header, the first bytes of the resource, or the entire resource to determine its MIME type.
|
|---|