Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Identifying PDF from URLs

by listanand (Sexton)
on May 24, 2010 at 21:26 UTC ( [id://841446]=perlquestion: print w/replies, xml ) Need Help??

listanand has asked for the wisdom of the Perl Monks concerning the following question:

Hi perl monks,

I have a situation where I need to identify whether a URL is a PDF file or not. Here's an example URL : http://ccdl.libraries.claremont.edu/u?/stc,87 Looking at it, it's not obvious that it links to a PDF.

Here's the relevant code snippet. It is not working though.

Thanks in advance.

use MIME::Types;

my $mt = MIME::Types->new;

my $url="http://ccdl.libraries.claremont.edu/u?/stc,87";

if($mt->mimeTypeOf($url) eq "application/pdf"){

print "$url is a PDF file\n";

}

Replies are listed 'Best First'.
Re: Identifying PDF from URLs
by tinita (Parson) on May 24, 2010 at 23:39 UTC
    You can try LWP::Simple. if the server sets the content-type header correctly:
    use LWP::Simple qw/ head /; my ($content_type) = head($url);
    without fetching at least the http header there's no way to determine the file type. If the file type is not set correctly you could fetch the body with LWP::Simple::get and then feed this to MIME::Types (after writing it to a file if necessary).

    update: btw, your example url is simply html. but it has an iframe whose source links to a pdf.

Re: Identifying PDF from URLs
by Corion (Patriarch) on May 24, 2010 at 21:35 UTC

    So, how does it fail for you? What steps have you taken to identify the cause? Where in the documentation of MIME::Types does it say that you can use it on an URL?

    If you actually read the MIME::Types documentation for mimeTypeOf, you will find that it says:

    $obj->mimeTypeOf(FILENAME)
    Returns the MIME::Type object which belongs to the FILENAME (or simply its filename extension) or undef if the file type is unknown. The extension is used, and considered case-insensitive.

    Nowhere does it say that it would apply to URLs, and it only looks at the name, or rather, even only at the extension.

    So maybe your snippet is not even relevant?

Re: Identifying PDF from URLs
by Your Mother (Archbishop) on May 24, 2010 at 23:53 UTC

    tinita is steering you rightly. You can see from this snippet though that the page is not a PDF. Even the page it redirects to in a browser is not a PDF but an HTML page with a PDF viewer embedded. Getting the PDF from that scheme might not end up being trivial. :(

    perl -MLWP::Simple=head -le 'print [ head(+shift) ]->[0]' "http://ccdl +.libraries.claremont.edu/u?/stc,87" text/html
      Getting the PDF from that scheme might not end up being trivial.
      Indeed. In this case it seems HEAD requests are blocked. I tried to fetch the direct link to the pdf with the HEAD script and it returned text/html and "Content-Disposition: filename=404.txt". So it's necessary here probably to use a GET request with LWP::UserAgent and from there read the http headers :-/
        Hi all,

        Thanks for your replies.

        I will try these suggestions later tonight and get back to you.

        Andy

Re: Identifying PDF from URLs
by aquarium (Curate) on May 24, 2010 at 23:07 UTC
    What you probably want to do is use one of the CGI modules to get and check the header for the url, hopefully the header will be set with one of the pdf related mime types.
    the hardest line to type correctly is: stty erase ^H
Re: Identifying PDF from URLs
by JavaFan (Canon) on May 31, 2010 at 11:18 UTC
    First thing: in general, you cannot determine a mime type of a resource by just looking at a URL. In fact, given a fixed URL, a server might give you a PNG image, HTML document, a random stream of bytes, a PDF document or a 404 error, depending on a role of a die.

    You can do a request for the resource, and look at the HTTP header, to see what the server claims the MIME type of the resource is. If you trust the server(s), this may be enough for you. Else, you will actually have to download the resource, and inspect it. You could look at the magic bytes and determine the file type from that (PDF files start with %PDF-, so you only need to first 5 bytes of the resource) - but even that may not be enough. It's only a proper PDF file if the entire file has the correct syntax. For that, you'd need to download the entire source and parse it.

    So, the summarize: you cannot determine the document format from the URL alone - you'll have to query the server. Depending on your level of trust, you need either the HTTP header, the first bytes of the resource, or the entire resource to determine its MIME type.

Re: Identifying PDF from URLs
by doug (Pilgrim) on May 25, 2010 at 22:48 UTC

    listanand,

    Is performance a requirement, or can you use brute force? You could write this in bash with 'wget' to slurp the URL and 'file' to check to see if it is a pdf or not. It is not the thing to do at the far end of a slow connection, but if you've throughput to waste ...

    Don't forget that if you have no control over the server, there is no requirement that it tells you anything about the link. I don't know if it is still the case, but MS's IIS used to call everything application/x-octet-string and let the client figure it out.

    - doug

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://841446]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (12)
As of 2024-04-23 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found