http://qs321.pair.com?node_id=11121073

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

I have an issue with reading a gzipped UTF-8 encoded file.

Here is an example:

preparation: put an umlaut into a file and gzip it. Also

echo ü > umlaut gzip -k umlaut

Now check the difference :(

perl -e ' use IO::Uncompress::Gunzip; binmode(STDOUT, ":utf8"); open my $in, "<:utf8", "umlaut"; $_=<$in>; print "Uncompressed: $_ ",ord($_),"\n"; my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); binmode($gin, ":utf8"); $_=<$gin>; print " Compressed: $_ ",ord($_),"\n"; '

Output

Uncompressed: ü
 252
  Compressed: ü
 195

In theory there shouldn't be a difference between the outputs :(

Update: I learned that "binmode" won't do anything to the IO::Uncompress::Gunzip filehandle.

Handling the decode myself, not relying on an IO-layer, gives the expected result:

perl -e ' use IO::Uncompress::Gunzip; use Encode; binmode(STDOUT, ":utf8"); my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); $_=<$gin>; $_ = Encode::decode("UTF-8", $_); print " Compressed: $_ ",ord($_),"\n"; '

Update: As suggested by Corion I'm now using PerlIO::gzip. My Original code, note the test example shown here, now is:

my $encoding = ":utf8"; if ( $filename =~ /\.gz$/ ) { $encoding = ":gzip$encoding"; } open $in, "<$encoding", $filename or die "Can't read $filename +: $!\n";

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e