Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

issue with reading IO::Uncompress:Gunzip and utf-8

by Skeeve (Parson)
on Aug 25, 2020 at 10:24 UTC ( #11121073=perlquestion: print w/replies, xml ) Need Help??

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

I have an issue with reading a gzipped UTF-8 encoded file.

Here is an example:

preparation: put an umlaut into a file and gzip it. Also

echo > umlaut gzip -k umlaut

Now check the difference :(

perl -e ' use IO::Uncompress::Gunzip; binmode(STDOUT, ":utf8"); open my $in, "<:utf8", "umlaut"; $_=<$in>; print "Uncompressed: $_ ",ord($_),"\n"; my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); binmode($gin, ":utf8"); $_=<$gin>; print " Compressed: $_ ",ord($_),"\n"; '

Output

Uncompressed: 
 252
  Compressed: ü
 195

In theory there shouldn't be a difference between the outputs :(

Update: I learned that "binmode" won't do anything to the IO::Uncompress::Gunzip filehandle.

Handling the decode myself, not relying on an IO-layer, gives the expected result:

perl -e ' use IO::Uncompress::Gunzip; use Encode; binmode(STDOUT, ":utf8"); my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); $_=<$gin>; $_ = Encode::decode("UTF-8", $_); print " Compressed: $_ ",ord($_),"\n"; '

Update: As suggested by Corion I'm now using PerlIO::gzip. My Original code, note the test example shown here, now is:

my $encoding = ":utf8"; if ( $filename =~ /\.gz$/ ) { $encoding = ":gzip$encoding"; } open $in, "<$encoding", $filename or die "Can't read $filename +: $!\n";

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: issue with readin IO::Uncompress:Gunzip and utf-8
by Corion (Patriarch) on Aug 25, 2020 at 11:30 UTC

    I've had success doing the same thing but with PerlIO::gzip and stacking the filters appropriately:

    open my $fh, '<:gzip:encoding(UTF-8)'; ...

    Maybe that helps you more in your pipeline.

      Thanks Corion, that really helped a lot. it was also suggested to me by ilmari on IRC.

      Will Update my Question now.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: issue with reading IO::Uncompress:Gunzip and utf-8
by hippo (Bishop) on Aug 25, 2020 at 11:01 UTC
    binmode($gin, ":utf8");

    That won't achieve anything. The docs say:

    This is a noop provided for completeness.

    Decoding is therefore left to you.


    🦛

Re: issue with readin IO::Uncompress:Gunzip and utf-8
by jeffenstein (Friar) on Aug 25, 2020 at 11:03 UTC

    Gzip doesn't seem to store the encoding information, so you'll have to decode() the result.

    perl -e ' use IO::Uncompress::Gunzip; use Encode; binmode(STDOUT, ":utf8"); open my $in, "<:utf8", "umlaut"; $_=<$in>; print "Uncompressed: $_ ",ord($_),"\n"; my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); binmode($gin, ":utf8"); $_=<$gin>; print " Compressed: $_ ",ord($_),"\n"; $_ = decode("utf-8", $_); print " Decoded: $_ ",ord($_),"\n"; '

    Output:

    Uncompressed: 252 Compressed: ü 195 Decoded: 252

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11121073]
Approved by kcott
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2022-12-09 01:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?