Identifying PDF from URLs

listanand has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Identifying PDF from URLs by tinita (Parson) on May 24, 2010 at 23:39 UTC
You can try LWP::Simple. if the server sets the content-type header correctly: `use LWP::Simple qw/ head /; my ($content_type) = head($url);` [download] without fetching at least the http header there's no way to determine the file type. If the file type is not set correctly you could fetch the body with LWP::Simple::get and then feed this to MIME::Types (after writing it to a file if necessary). update: btw, your example url is simply html. but it has an iframe whose source links to a pdf.	[reply] [d/l]
Re: Identifying PDF from URLs by Corion (Patriarch) on May 24, 2010 at 21:35 UTC
So, how does it fail for you? What steps have you taken to identify the cause? Where in the documentation of MIME::Types does it say that you can use it on an URL? If you actually read the MIME::Types documentation for `mimeTypeOf`, you will find that it says: `$obj->mimeTypeOf(FILENAME)` [download] Returns the MIME::Type object which belongs to the FILENAME (or simply its filename extension) or undef if the file type is unknown. The extension is used, and considered case-insensitive. Nowhere does it say that it would apply to URLs, and it only looks at the name, or rather, even only at the extension. So maybe your snippet is not even relevant?	[reply] [d/l] [select]
Re: Identifying PDF from URLs by Your Mother (Archbishop) on May 24, 2010 at 23:53 UTC
tinita is steering you rightly. You can see from this snippet though that the page is not a PDF. Even the page it redirects to in a browser is not a PDF but an HTML page with a PDF viewer embedded. Getting the PDF from that scheme might not end up being trivial. :( `perl -MLWP::Simple=head -le 'print [ head(+shift) ]->[0]' "http://ccdl +.libraries.claremont.edu/u?/stc,87" text/html` [download]	[reply] [d/l]
Re^2: Identifying PDF from URLs by tinita (Parson) on May 25, 2010 at 00:19 UTC
Getting the PDF from that scheme might not end up being trivial. Indeed. In this case it seems HEAD requests are blocked. I tried to fetch the direct link to the pdf with the HEAD script and it returned text/html and "Content-Disposition: filename=404.txt". So it's necessary here probably to use a GET request with LWP::UserAgent and from there read the http headers :-/	[reply]
Re^3: Identifying PDF from URLs by listanand (Sexton) on May 25, 2010 at 00:23 UTC
Hi all, Thanks for your replies. I will try these suggestions later tonight and get back to you. Andy	[reply]
Re: Identifying PDF from URLs by aquarium (Curate) on May 24, 2010 at 23:07 UTC
What you probably want to do is use one of the CGI modules to get and check the header for the url, hopefully the header will be set with one of the pdf related mime types. the hardest line to type correctly is: stty erase ^H	[reply]
Re: Identifying PDF from URLs by JavaFan (Canon) on May 31, 2010 at 11:18 UTC
First thing: in general, you cannot determine a mime type of a resource by just looking at a URL. In fact, given a fixed URL, a server might give you a PNG image, HTML document, a random stream of bytes, a PDF document or a 404 error, depending on a role of a die. You can do a request for the resource, and look at the HTTP header, to see what the server claims the MIME type of the resource is. If you trust the server(s), this may be enough for you. Else, you will actually have to download the resource, and inspect it. You could look at the magic bytes and determine the file type from that (PDF files start with `%PDF-`, so you only need to first 5 bytes of the resource) - but even that may not be enough. It's only a proper PDF file if the entire file has the correct syntax. For that, you'd need to download the entire source and parse it. So, the summarize: you cannot determine the document format from the URL alone - you'll have to query the server. Depending on your level of trust, you need either the HTTP header, the first bytes of the resource, or the entire resource to determine its MIME type.	[reply]
Re: Identifying PDF from URLs by doug (Pilgrim) on May 25, 2010 at 22:48 UTC
listanand, Is performance a requirement, or can you use brute force? You could write this in bash with 'wget' to slurp the URL and 'file' to check to see if it is a pdf or not. It is not the thing to do at the far end of a slow connection, but if you've throughput to waste ... Don't forget that if you have no control over the server, there is no requirement that it tells you anything about the link. I don't know if it is still the case, but MS's IIS used to call everything application/x-octet-string and let the client figure it out. - doug	[reply]


go ahead... be a heretic
	PerlMonks