http://qs321.pair.com?node_id=666191

Yappo has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I am currently working on an improved spider that is able to get not only HTML content but also media that are not linked from the HTML-source directly.
This information is stored in the headers and can be logged with LiveHTTPheaders and/or Tamper Data (Firefox PlugIns). What I would like to do is log the headers as they come in and are sent just like LiveHTTPheaders plugin does.
I have used UserAgent and WWW::Mechanize-no results, even after sending exactly the same headers as they were logged in FF.
I could get results with tshark but that is not what I want.
Question: is it possible to trace ALL incoming and outgoing headers with a CPAN-module that I can strip the location of the media taht are transported in the headers?
To demonstrate what I mean here is an example header from a LiveHTTPheaders log that plays a video after the URL has been loaded:
---------------------------------------------------------- http://somedomain.com/path/path/67c954839cbf962fe044893124536gtre3251. +flv GET /path/path/67c954839cbf962fe044893124536gtre3251.flv HTTP/1.1 Host: somedomain.com User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.11) +Gecko/20071127 Firefox/2.0.0.11 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9 +,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Cookie: ip_bt=38; wlrcmd=; __utma=68674883.507175212.1202182535.120218 +2535.1202182535.1; __utmb=68674883; __utmc=68674883; __utmz=68674883. +1202182535.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none); newsRea +d=2008-02-04+16%3A47%3A23 HTTP/1.x 200 OK Date: Tue, 05 Feb 2008 03:35:51 GMT Server: Apache Last-Modified: Mon, 04 Feb 2008 05:24:23 GMT Etag: "a0148cea-287267-56362fc0" Accept-Ranges: bytes Content-Length: 2650727 Expires: Mon, 11 Feb 2008 06:18:17 GMT Age: 76654 Keep-Alive: timeout=5, max=128 Connection: Keep-Alive Content-Type: text/plain ----------------------------------------------------------
I have nver succeeded in getting the Host and GET-variable from the server.
Thanks.

Replies are listed 'Best First'.
Re: Getting all Headers
by naikonta (Curate) on Feb 05, 2008 at 06:59 UTC
    Subclass from LWP::UserAgent and implement the method progress. Then I think you can do pretty much whatever you want. The following snippet only prints out available response header names, but you can extend it to also print the header values, and for request headers as well.
    $ cat lwp-headers.pl #!/usr/bin/perl package MyLWP; use base 'LWP::UserAgent'; sub progress { my($self, $status, $resp) = @_; if ($resp) { my @headers = $resp->header_field_names; print "response headers: @headers\n"; } } package main; my $ua = MyLWP->new; my $url = shift || 'http://gmail.google.com'; my $rp = $ua->get($url); print $rp->status_line, "\n"; $ ./lwp-headers.pl http://www.google.com response headers: Cache-Control Connection Date Location Content-Lengt +h Content-Type Client-Peer Client-Response-Num Set-Cookie Title X-Cac +he X-Cache-Lookup response headers: Cache-Control Connection Date Location Content-Lengt +h Content-Type Client-Date Client-Peer Client-Response-Num Set-Cookie + Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Pe +er Client-Response-Num Set-Cookie Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Pe +er Client-Response-Num Set-Cookie Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Pe +er Client-Response-Num Set-Cookie Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Pe +er Client-Response-Num Set-Cookie Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Pe +er Client-Response-Num Set-Cookie Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Pe +er Client-Response-Num Set-Cookie Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Content-Type Client-Da +te Client-Peer Client-Response-Num Set-Cookie Title X-Cache X-Cache-L +ookup 200 OK $ ./lwp-headers.pl response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Peer Client-Response-Num Title X-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Date Client-Peer Client-Response-Num Title X-Cache X-Ca +che-Lookup response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Peer Client-Response-Num Set-Cookie Title X-Cache X-Cac +he-Lookup response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Date Client-Peer Client-Response-Num Set-Cookie Title X +-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Peer Client-Response-Num Set-Cookie Title X-Cache X-Cac +he-Lookup response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Peer Client-Response-Num Set-Cookie Title X-Cache X-Cac +he-Lookup response headers: Cache-Control Connection Date Pragma Location Conten +t-Type Client-Date Client-Peer Client-Response-Num Set-Cookie Title X +-Cache X-Cache-Lookup response headers: Cache-Control Connection Date Pragma Server Content- +Length Content-Type Client-Peer Client-Response-Num Client-SSL-Cert-I +ssuer Client-SSL-Cert-Subject Client-SSL-Cipher Client-SSL-Warning Se +t-Cookie ..... response headers: Cache-Control Connection Date Pragma Server Content- +Length Content-Type Client-Date Client-Peer Client-Response-Num Clien +t-SSL-Cert-Issuer Client-SSL-Cert-Subject Client-SSL-Cipher Client-SS +L-Warning Set-Cookie Title 200 OK

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Re: Getting all Headers
by ikegami (Patriarch) on Feb 05, 2008 at 04:08 UTC

    I get the GET line. (Keep in mind that WWW::Mechanize is a subclass of LWP::UserAgent.)

    use LWP::UserAgent qw( ); my $ua = LWP::UserAgent->new(); my $response = $ua->get('http://www.google.com/'); print($response->request()->as_string(), "\n");
    GET http://www.google.ca/ User-Agent: libwww-perl/5.805

    Update: hum, is that really what it sends? Is that even valid?
    Update: yeah, it works if I send that raw. I'm not used to HTTP/0.9

Re: Getting all Headers
by Yappo (Novice) on Feb 05, 2008 at 15:25 UTC
    @ikegami: Result of your code is:
    GET http://www.google.com/ User-Agent: libwww-perl/5.805
    @naikonta: Result of your code is:
    200 OK
    Unfortunately no response headers...
      I use LWP::UserAgent version 2.036 from distribution libwww-perl-5.808. The method progress was added since libwww-perl-5.806. Upgrade your version and try again. I wonder however my code runs at all with libwww-perl-5.805 (that included LWP::UA 2.033 which didn't provide method progress yet).

      Update: How stupid of me. My code runs with LWP::UA 2.033 because the method progress isn't called at all


      Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

        Thanks.
        I upgraded and succeeded to make this work, unfortunately this method does not bring the result I need using headers_as_string.
        I will give you an example:
        http://www.clipfish.de/player.php?videoid=MzkyOTN8OTA%3D
        has a video inside.
        There is no direct link to this video in the HTML source, the location will only be sent within the headers.
        Using LWP::UserAgent with method progress does not bring this URL which is as follows (from LiveHTTPheaders-log, snippet from header):
        http://pg1.clipfish.de/media/96/4efec90e3f12d631b4f5b490db152596.flv GET /media/96/4efec90e3f12d631b4f5b490db152596.flv HTTP/1.1 Host: pg1.clipfish.de ...
        Would it be possible to get the "host"-information and URL of the video from the headers using LWP or do I have to use another module or method?
Re: Getting all Headers
by Anonymous Monk on Feb 05, 2008 at 10:58 UTC
    I have nver succeeded in getting the Host and GET-variable from the server.
    Show your code