Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Determining Content-Length when there is no Content-Length header

by hacker (Priest)
on Sep 30, 2007 at 00:24 UTC ( [id://641735]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I've run into an interesting problem while testing a new piece of code I'm writing.

During testing, I pointed my code at various dozens of websites; static content, dynamic content, images, pdfs, etc. and it all worked great. I was checking the remote end's Content-Type header and their Content-Length header using HEAD, to see if I should fetch it or not.

Basically if the size reported in Content-Length was too large, I'd ignore the fetch.

my $req = HTTP::Request->new(HEAD => $url); my $resp = $ua->request($req); my $type = $resp->header('Content-Type'); my $content = $resp->content; my $content_len = $resp->header('Content-Length');

This was working great, until I realized that a lot of servers don't send a Content-Length header. DOH! Even sites serving static, flat text or html content, are not sending a Content-Length header.

In the above snippet, I'm using HEAD, so as to avoid using a GET request on larger files, and then ignore the processing of them after I'd already fetched them.

So I started trying to figure out a way to determine the length of the remote content, without actually fetching the content itself, and this is where I'm stuck.

I could do this:

my $req = HTTP::Request->new(GET => $url); my $content = $resp->content; my $content_len = length($content);

But now I'm doing a GET, and if someone decides to point that to a 20-gigabyte file, or a DVD iso or something like that, it'll drown my bandwidth, and DDoS my tool for other users.

Is there some other way to do this, without doing a full fetch of the remote resource?

Update: This sort-of works, but for sites without a Content-Length header, I do a double-hit, HEAD first, then GET second. Is there a better way?

my $req = HTTP::Request->new(HEAD => $pl_url); my $resp = $ua->request($req); my $type = $resp->header('Content-Type'); my $status_line = $resp->status_line; my ($content, $content_len); if ($resp->header('Content-Length')) { $content_len = $resp->header('Content-Length'); } else { $req = HTTP::Request->new(GET => $pl_url); $resp = $ua->request($req); $content = $resp->content; $content_len = length($content); }

Replies are listed 'Best First'.
Re: Determining Content-Length when there is no Content-Length header
by calin (Deacon) on Sep 30, 2007 at 01:26 UTC

    You can try to use a partial GET (byte range) to seek around the file to find its end by trial (and avoid downloading 20GB), but I'm pretty sure 99% of the resources supporting a partial GET will also report a Content-Length in the header.

    Short answer: no, AFAIK

    Update:Here's what I found in the standard. I don't quite understand the verbiage, but take a look at this.

      You can try to use a partial GET (byte range) to seek around the file to find its end by trial (and avoid downloading 20GB), but I'm pretty sure 99% of the resources supporting a partial GET will also report a Content-Length in the header.
      They're definitely not required to be present together.

      Furthermore, the server doesn't necessarily have to tell you that it doesn't support range requests at all (let alone in a useful manner).
      The RFC says:

      Note: clients cannot depend on servers to send a 416 (Requested range not satisfiable) response instead of a 200 (OK) response for an unsatisfiable Range request-header, since not all servers implement this request-header.
      The definitive answer is no.

      Now, for hacker...

      Alternatively, why not use the ':content_cb' callback of LWP::UserAgent. With that callback, you can implement your own semantics for max-content-length; if you decide you no-longer want to fetch the file when you reach say 2k, just abort the request by die()ing.

      -David

Re: Determining Content-Length when there is no Content-Length header
by Anonymous Monk on Sep 30, 2007 at 03:20 UTC
    This sort-of works, but for sites without a Content-Length header, I do a double-hit, HEAD first, then GET second. Is there a better way?
    Always do get
Re: Determining Content-Length when there is no Content-Length header
by aquarium (Curate) on Oct 01, 2007 at 03:40 UTC
    First of all you probably don't actually need to find out the exact content length...you just need to know if certain urls contain data over a certain size threshold. you'll need to decide what is the acceptable threshold, and... instead of using the higher level HTTP functions, use sockets to read url data up to maximum size limit. whilst you're reading this into your buffer, you should be able to parse any content-length header that may come along. so if content-length header is present, you can decide to stop reading or keep going to read full file....and if there's no content-length header, continue reading up to your set threshold for entire length. hope this makes sense. btw i think it's possible to a server to lie about content-length and get away with it.
    the hardest line to type correctly is: stty erase ^H

      OK, this is a very old thread, but I looked at this thread when searching for some information on a related problem, and now that I've solved it I think it should be posted here since Googling "Perl CURLOPT_RANGE" doesn't currently return any useful hits.

      OK, the bottom line is that if you want to fetch a piece of a remote file using Perl you can take the WWW:Curl package

      http://search.cpan.org/~szbalint/WWW-Curl-4.15/lib/WWW/Curl.pm

      and modify the first example to include the lines

      my $firstbyte = 50; my $lastbyte = 100; $curl->setopt(CURLOPT_RANGE,"$firstbyte-$lastbyte");

      So the OP could use this technique to see whether, e.g. he's able to successfully fetch the 1,000,000th byte of a remote file. If he can fetch it, then he might decide not to try to download that file.

      I hope that this info is useful to someone.

        Nice idea, but not all web servers / web applications support byte ranges. I think the proper behaviour for a web server is to ignore the unknown / unsupported header and send the entire resource -- which is clearly not what the OP wanted. See also Re: Determining Content-Length when there is no Content-Length header

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://641735]
Approved by calin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (1)
As of 2024-04-16 21:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found