http://qs321.pair.com?node_id=1113451

sam_bakki has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I have a perl script to download data from HTTPS site. I was using Crypt::SSLeay. My script is working fine, I could properly download full data (csv file) from the server.

I thought of give a try with LWP's inbuilt IO::Socket::SSL.

Actually I am using WWW::Mechanize in my script, Script failed in  $mech->response()->decoded_content() phase. I tried to debug more and I found that it could not deflate the gzip compress data sent from server.

Surprised. I thought to debug more and disabled the compression using  $mech->add_header('Accept-Encoding' => '');

Now, I could see the data comes from the server but its not complete data, i see only first few bytes. I examine the HTTP::Response headers, I find

'client-transfer-encoding' => [ 'chunked' ]


Looks like the server is sending the chunked data to me. LWP / IO::Socket::SSL could not work with "chunked" data transfer. So gzip content decode fails.

when I force to use Crypt::SSLeay like below,

use Crypt::SSLeay; use Net::SSL; use WWW::Mechanize; .... $ENV{PERL_NET_HTTPS_SSL_SOCKET_CLASS} = "Net::SSL"; $mech = WWW::Mechanize->new(autocheck =>1, noproxy =>1,ssl_opts => { ' +verify_hostname' => 0 }); ...


I see full data comes to me from server. I still see "chunked" header but its properly handled by Net::SSL / Crypt::SSleay .

Q: Does any one face this issue? Perl LWP Can handle "Chunked" data transfer over SSL?. Thanks for your time.

Update: Added 2 test scripts to demonstrate the problem

One uses the Net::SSL and downloads data properly from Server
Other uses IO::Socket::SSL and downloads only first chunk (I think) from server and quits.

To show the differences b/w downloads, I have shown MD5 sum and file sizes.

My environment
OS: Windows 7 , x86_64 bit Perl: Active Perl , perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x86-multi-thread-64int

Note: I saw the same behavior in Active Perl 5.10, 5.14, 5.16 and 5.18

Script 1 - Using Net::SSL and Crypt::SSLeay - Working

#WORKING HTTPS DOWNLOAD Using Net::SSL in Windows + Active Perl use strict; use warnings; use Crypt::SSLeay; use Net::SSL; use WWW::Mechanize; use HTTP::Cookies; use HTTP::Message; use Digest::MD5; use File::Slurp; use Data::Dumper; #Globals $|=1; #Force LWP to use Net::SSL instead of IO::Socket::SSL $ENV{PERL_NET_HTTPS_SSL_SOCKET_CLASS} = "Net::SSL"; $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0; delete $ENV{https_proxy} if exists $ENV{https_proxy}; delete $ENV{http_proxy} if exists $ENV{http_proxy}; #Variables my $browser = ""; my $url = 'https://developer.apple.com/standards/qtff-2001.pdf'; my $pageContent = ''; my $fileName = ''; my $md5Obj = Digest::MD5->new(); print "\n USING Net::SSL"; #Init Mechanize $browser = WWW::Mechanize->new(autocheck =>1, noproxy=>1, ssl_opts => +{ 'verify_hostname' => 0 }); # Add cookie jar $browser->cookie_jar(HTTP::Cookies->new()); $browser->agent_alias( 'Linux Mozilla'); $browser->add_header('Accept-Encoding'=>scalar HTTP::Message::decodabl +e()); $browser->timeout(120); #Get URL $browser->get($url); if ($browser->success()) { print "\n INFO: Got URL: $url"; $fileName = $browser->response()->filename(); print "\n INFO: Save in File: $fileName"; $browser->save_content($fileName); #Calculate MD5 sum $pageContent = read_file( $fileName, binmode => ':raw' ); print "\n INFO: $fileName Size: ", length($pageContent)/1024," KB" +; $md5Obj->add($pageContent); print "\n INFO: $fileName MD5 Sum: ", $md5Obj->hexdigest(); undef $md5Obj; } else { print "\n ERROR: Can't get URL $url ",$browser->status(); } print "\n\n INFO: ********************* DUMP ********************"; print "\n",Dumper(\$browser); print "\n INFO: ********************* DUMP ********************"; exit 0;

Output1:


  USING Net::SSL
 INFO: Got URL: https://developer.apple.com/standards/qtff-2001.pdf
 INFO: Save in File: qtff-2001.pdf
 INFO: qtff-2001.pdf Size: 5465.48046875 KB
 INFO: qtff-2001.pdf MD5 Sum: d1aee95cc06d529e67b707257a5cf3eb

Script 2 - Using IO::Socket::SSL - Not Working. Only part of the PDF file is downloaded

#NOT WORKING HTTPS DOWNLOAD Using IO::Socket::SSL in Windows + Active +Perl use strict; use warnings; use IO::Socket::SSL; use WWW::Mechanize; use HTTP::Cookies; use HTTP::Message; use Digest::MD5; use File::Slurp; use Data::Dumper; #Globals $|=1; $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0; #Variables my $browser = ""; my $url = 'https://developer.apple.com/standards/qtff-2001.pdf'; my $pageContent = ''; my $fileName = ''; my $md5Obj = Digest::MD5->new(); print "\n USING IO::Socket::SSL"; #Init Mechanize $browser = WWW::Mechanize->new(autocheck =>1, noproxy=>1,ssl_opts => { + 'verify_hostname' => 0 }); # Add cookie jar $browser->cookie_jar(HTTP::Cookies->new()); $browser->agent_alias( 'Linux Mozilla'); $browser->add_header('Accept-Encoding'=>scalar HTTP::Message::decodabl +e()); $browser->timeout(120); #Get URL $browser->get($url); if ($browser->success()) { print "\n INFO: Got URL: $url"; $fileName = $browser->response()->filename(); print "\n INFO: Save in File: $fileName"; $browser->save_content($fileName); #Calculate MD5 sum $pageContent = read_file( $fileName, binmode => ':raw' ); print "\n INFO: $fileName Size: ", length($pageContent)/1024," KB" +; $md5Obj->add($pageContent); print "\n INFO: $fileName MD5 Sum: ", $md5Obj->hexdigest(); undef $md5Obj; } else { print "\n ERROR: Can't get URL $url ",$browser->status(); } print "\n\n INFO: ********************* DUMP ********************"; print "\n",Dumper(\$browser); print "\n INFO: ********************* DUMP ********************"; exit 0;

Output2:


  USING IO::Socket::SSL
 INFO: Got URL: https://developer.apple.com/standards/qtff-2001.pdf
 INFO: Save in File: qtff-2001.pdf
 INFO: qtff-2001.pdf Size: 6.66796875 KB
 INFO: qtff-2001.pdf MD5 Sum: 4049c364f7332790c3abe548d6a4297c

I did not paste the Dumper output because its huge and not properly copied to browser because of the binary contents.

Please help me to understand why scripts behave differently? I was thinking, its chunking issues ...

Thanks & Regards,
Bakkiaraj M
My Perl Gtk2 technology demo project - http://code.google.com/p/saaral-soft-search-spider/ , contributions are welcome.

Replies are listed 'Best First'.
Re: Perl LWP Can handle client-transfer-encoding = chunked encoding?
by FloydATC (Deacon) on Jan 16, 2015 at 09:02 UTC

    Generally speaking, LWP does not care about the actual content or format, it simply passes on to the web server what kind of content encoding you are willing to accept, in the form of properly formatted request headers.

    The web server should pay attention to this and either send data in an appropriate format/encoding (as indicated by the response headers) or inform you that it can't comply with your requirements. You may request documents in 9 bit Morse encoding compressed with a proprietary Javascript library but the server is free to say no. What the server should not do is send you data in an imaginary format when you clearly stated what you were willing to accept.

    LWP has some convenience features that can help you with decoding of some common well-known encoding types but other than that you are free to take the binary content and decode it any way you like, which hopefully matches what the server indicated.

    In this particular case, you may want to look at modules that deal with MIME content unless I'm very much mistaken.

    -- FloydATC

    Time flies when you don't know what you're doing

      Hi FloydATC

      Thanks for the details. IMHO, I am suffering from the "chunk" mode of transfer. Basically, Sever sends the data in multiple chunks, So the LWP need to make a multiple requests to get full data. As you have stated, LWP might not care (or dont need to care) about the data and encoding / chunked or non chunked.

      Looks like Net::SSL can handle chunks properly where as IO::Socket::SSL is not handling it. I am not sure how to disable this chunking by using right http headers so server always sends data in one single chunk.

      Thanks & Regards,
      Bakkiaraj M
      My Perl Gtk2 technology demo project - http://code.google.com/p/saaral-soft-search-spider/ , contributions are welcome.

        No, that's usually not what "chunked" encoding means. It usually means that the response contains more than one file/document and you have to use the appropriate method to separate them.

        What you are referring to is a partial response, which the server is only allowed to send if you explicitly ask for it. This is commonly used when resuming a large download or seeking in streamed media. (See "range request".)

        -- FloydATC

        Time flies when you don't know what you're doing

Re: Perl LWP Can handle client-transfer-encoding = chunked encoding?
by noxxi (Pilgrim) on Jan 16, 2015 at 09:45 UTC
    Support for chunked encoding is independent from the SSL backend (i.e. Crypt::SSLeay or IO::Socket::SSL). And as far as I know LWP supports chunked encoding both for requests and for responses as required by HTTP/1.1.
    Could you please show a test program (with a URL you have problems with) so that other can reproduce your problem? And please note the versions of LWP::UserAgent and IO::Socket::SSL you are using.

      Hi noxxi

      As you have suggested, I have written a two scripts,
      One uses the Net::SSL and downloads data properly from Server
      Other uses IO::Socket::SSL and downloads only first chunk (I think) from server and quits.

      To show the differences b/w downloads, I have shown MD5 sum and file sizes.

      My environment
      OS: Windows 7 , x86_64 bit Perl: Active Perl , perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x86-multi-thread-64int

      Note: I saw the same behavior in Active Perl 5.10, 5.14, 5.16 and 5.18

      Script 1 - Using Net::SSL and Crypt::SSLeay - Working

      #WORKING HTTPS DOWNLOAD Using Net::SSL in Windows + Active Perl use strict; use warnings; use Crypt::SSLeay; use Net::SSL; use WWW::Mechanize; use HTTP::Cookies; use HTTP::Message; use Digest::MD5; use File::Slurp; use Data::Dumper; #Globals $|=1; #Force LWP to use Net::SSL instead of IO::Socket::SSL $ENV{PERL_NET_HTTPS_SSL_SOCKET_CLASS} = "Net::SSL"; $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0; delete $ENV{https_proxy} if exists $ENV{https_proxy}; delete $ENV{http_proxy} if exists $ENV{http_proxy}; #Variables my $browser = ""; my $url = 'https://developer.apple.com/standards/qtff-2001.pdf'; my $pageContent = ''; my $fileName = ''; my $md5Obj = Digest::MD5->new(); print "\n USING Net::SSL"; #Init Mechanize $browser = WWW::Mechanize->new(autocheck =>1, noproxy=>1, ssl_opts => +{ 'verify_hostname' => 0 }); # Add cookie jar $browser->cookie_jar(HTTP::Cookies->new()); $browser->agent_alias( 'Linux Mozilla'); $browser->add_header('Accept-Encoding'=>scalar HTTP::Message::decodabl +e()); $browser->timeout(120); #Get URL $browser->get($url); if ($browser->success()) { print "\n INFO: Got URL: $url"; $fileName = $browser->response()->filename(); print "\n INFO: Save in File: $fileName"; $browser->save_content($fileName); #Calculate MD5 sum $pageContent = read_file( $fileName, binmode => ':raw' ); print "\n INFO: $fileName Size: ", length($pageContent)/1024," KB" +; $md5Obj->add($pageContent); print "\n INFO: $fileName MD5 Sum: ", $md5Obj->hexdigest(); undef $md5Obj; } else { print "\n ERROR: Can't get URL $url ",$browser->status(); } print "\n\n INFO: ********************* DUMP ********************"; print "\n",Dumper(\$browser); print "\n INFO: ********************* DUMP ********************"; exit 0;

      Output1:

      
        USING Net::SSL
       INFO: Got URL: https://developer.apple.com/standards/qtff-2001.pdf
       INFO: Save in File: qtff-2001.pdf
       INFO: qtff-2001.pdf Size: 5465.48046875 KB
       INFO: qtff-2001.pdf MD5 Sum: d1aee95cc06d529e67b707257a5cf3eb
      

      Script 2 - Using IO::Socket::SSL - Not Working. Only part of the PDF file is downloaded

      #NOT WORKING HTTPS DOWNLOAD Using IO::Socket::SSL in Windows + Active +Perl use strict; use warnings; use IO::Socket::SSL; use WWW::Mechanize; use HTTP::Cookies; use HTTP::Message; use Digest::MD5; use File::Slurp; use Data::Dumper; #Globals $|=1; $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0; #Variables my $browser = ""; my $url = 'https://developer.apple.com/standards/qtff-2001.pdf'; my $pageContent = ''; my $fileName = ''; my $md5Obj = Digest::MD5->new(); print "\n USING IO::Socket::SSL"; #Init Mechanize $browser = WWW::Mechanize->new(autocheck =>1, noproxy=>1,ssl_opts => { + 'verify_hostname' => 0 }); # Add cookie jar $browser->cookie_jar(HTTP::Cookies->new()); $browser->agent_alias( 'Linux Mozilla'); $browser->add_header('Accept-Encoding'=>scalar HTTP::Message::decodabl +e()); $browser->timeout(120); #Get URL $browser->get($url); if ($browser->success()) { print "\n INFO: Got URL: $url"; $fileName = $browser->response()->filename(); print "\n INFO: Save in File: $fileName"; $browser->save_content($fileName); #Calculate MD5 sum $pageContent = read_file( $fileName, binmode => ':raw' ); print "\n INFO: $fileName Size: ", length($pageContent)/1024," KB" +; $md5Obj->add($pageContent); print "\n INFO: $fileName MD5 Sum: ", $md5Obj->hexdigest(); undef $md5Obj; } else { print "\n ERROR: Can't get URL $url ",$browser->status(); } print "\n\n INFO: ********************* DUMP ********************"; print "\n",Dumper(\$browser); print "\n INFO: ********************* DUMP ********************"; exit 0;

      Output2:

      
        USING IO::Socket::SSL
       INFO: Got URL: https://developer.apple.com/standards/qtff-2001.pdf
       INFO: Save in File: qtff-2001.pdf
       INFO: qtff-2001.pdf Size: 6.66796875 KB
       INFO: qtff-2001.pdf MD5 Sum: 4049c364f7332790c3abe548d6a4297c
      
      

      I did not paste the Dumper output because its huge and not properly copied to browser because of the binary contents.

      Please help me to understand why scripts behave differently? I was thinking, its chunking issues ...

      Thanks & Regards,
      Bakkiaraj M
      My Perl Gtk2 technology demo project - http://code.google.com/p/saaral-soft-search-spider/ , contributions are welcome.