Re: Problem while using WWW::Mechanize module for getting html

by cavac (Curate)
on Mar 04, 2020 at 14:43 UTC

in reply to Problem while using WWW::Mechanize module for getting html

GZip transfer encoding depends on the Client sending an "Accept-Encoding" header in the request which has to contain the string "gzip". (Other compression schemes like bzip2 are also possible).

If the server supports gzip and the client has requested it, the server *may* decide to send the BODY of the response compressed as a gzip stream (depending on things like if the file is compressible and if the server wants to spend CPU resources to reduce network load at this point in time). To do this, it adds a "Content-Encoding" header in the response with the value set to "gzip".

From what i remember, ye olde WWW::Mechanize doesn't send any Accept-Encoding header which is was gets it into trouble sometimes. Let me quote from RFC7231, page 41, Chapter "5.3.4 Accept-Encoding", sub-paragraph 1:

If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.

Here is the link:

This is what can get WWW::Mechanize in trouble, because the server MAY decide to use gzip, bzip2 or whatever in the reply. If you use WWW::Mechanize::GZip, which *does* send the correct header, the server is only allowed to either send uncompressed or gzip compressed, and WWW::Mechanize::GZip understands both as far as i remember. It's just the more reliable option.

BTW, when we are talking about Transfer-Encoding, this isn't the same as "file format". So you wont download a .gz file and unzip it. Instead, the content just gets gzipped on the server side for sending over the network, then it gets automatically decompressed by the client library before it gets handed (uncompressed) to the client. This is just to speed up transfer, in practise, your script should not even realize (or bother) that this compression magic is going on in the background to save network bandwith and speed up data transfer.

