Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Dealing with binary data and WWW::Mechanize and encoding stuff

by friedo (Prior)
on Dec 07, 2008 at 08:38 UTC ( [id://728674] : perlquestion . print w/replies, xml ) Need Help??

friedo has asked for the wisdom of the Perl Monks concerning the following question:


I've run into some confusion with an encoding issue. I'm fetching some files with Mechanize (PDF's in particular) and passing them in-memory to another function, in this case Compress::Zlib::memGzip.

use Compress::Zlib; ... $mech->get( $pdf_url ); my $compressed = Compress::Zlib::memGzip( $mech->content );
I'm getting the dreaded "wide character in memGzip" warning when I do this, which, if my understanding is correct, tells me a few things:
  • Mechanize (or somebody) is storing the PDF data as a character string
  • Since PDF is a byte format, I really don't want it in a character string
  • memGzip doesn't want a character string either
  • I have to somehow make it not a character string
And that's where I'm lost. I know how to convert character data to various encodings using Encode, and how to set binmode on a filehandle, but I can't seem to work out how to get that PDF data in the format it should be in.

Mech does have a save_content method which promises to save the content in binary mode if it's not a text/* MIME type (and I've checked that the MIME type is correct.) However, I'd hate to have to dump the content to a temp file just to read it in again.

Replies are listed 'Best First'.
Re: Dealing with binary data and WWW::Mechanize and encoding stuff
by davidrw (Prior) on Dec 07, 2008 at 15:19 UTC
    Taking advantage of LWP::UserAgent's "Handlers" (and WWW::Mechanize is a proper subclass of LWP::UserAgent) might give you better access to the content. And using IO::Compress::Gzip for better dealing with the content a piece at a time.
    use IO::Compress::Gzip; my $compressed; my $z; $mech->add_handler( response_header => sub { my($response, $ua, $h) = @_; $response->{default_add_content} = 0; $z = new IO::Compress::Gzip \$compressed or die; } ); $mech->add_handler( response_data => sub { my($response, $ua, $h, $data) = @_; print $z $data or die $!; return 1; } ); $mech->add_handler( response_done => sub { my($response, $ua, $h) = @_; close $z or die $!; } ); $mech->get($pdf_url); warn length $mech->content; # 0 cause of the 'default_add_content' se +tting warn length $compressed;
    Note that $z->print() and $z->close() work too.
    Note that LWP::UserAgent has a remove_handler method, too, in case this $mech object has to go do other stuff.

    Also, does IO::Compress::Gzip::gzip handle it any better directly from $mech->content?

    (side note) Can (potentially) save a mem copy of the content by instead passing $mech->content_ref to Compress::Zlib::memGzip

      Cool. I had forgotten about LWP::UserAgent's callback features. (In fact, there's all sorts of goodies in LWP::UA that one doesn't remember if one looks only at the Mech docs.)
Re: Dealing with binary data and WWW::Mechanize and encoding stuff
by Anonymous Monk on Dec 07, 2008 at 08:44 UTC

      That was the clue I needed. The underlying content method from HTTP::Message will get me the original unmolested byte string from the server. So all I have to do is use $mech->response->content instead of $mech->content, and it works perfectly.