Skeeve has asked for the wisdom of the Perl Monks concerning the following question:
I have an issue with reading a gzipped UTF-8 encoded file.
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Here is an example:
preparation: put an umlaut into a file and gzip it. Also
echo ü > umlaut gzip -k umlaut
Now check the difference :(
perl -e ' use IO::Uncompress::Gunzip; binmode(STDOUT, ":utf8"); open my $in, "<:utf8", "umlaut"; $_=<$in>; print "Uncompressed: $_ ",ord($_),"\n"; my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); binmode($gin, ":utf8"); $_=<$gin>; print " Compressed: $_ ",ord($_),"\n"; '
Output
Uncompressed: ü 252 Compressed: ü 195
In theory there shouldn't be a difference between the outputs :(
Update: I learned that "binmode" won't do anything to the IO::Uncompress::Gunzip filehandle.
Handling the decode myself, not relying on an IO-layer, gives the expected result:
perl -e ' use IO::Uncompress::Gunzip; use Encode; binmode(STDOUT, ":utf8"); my $gin= IO::Uncompress::Gunzip->new("umlaut.gz"); $_=<$gin>; $_ = Encode::decode("UTF-8", $_); print " Compressed: $_ ",ord($_),"\n"; '
Update: As suggested by Corion I'm now using PerlIO::gzip. My Original code, note the test example shown here, now is:
my $encoding = ":utf8"; if ( $filename =~ /\.gz$/ ) { $encoding = ":gzip$encoding"; } open $in, "<$encoding", $filename or die "Can't read $filename +: $!\n";
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: issue with readin IO::Uncompress:Gunzip and utf-8
by Corion (Patriarch) on Aug 25, 2020 at 11:30 UTC | |
by Skeeve (Parson) on Aug 25, 2020 at 12:25 UTC | |
Re: issue with readin IO::Uncompress:Gunzip and utf-8
by jeffenstein (Hermit) on Aug 25, 2020 at 11:03 UTC | |
Re: issue with reading IO::Uncompress:Gunzip and utf-8
by hippo (Bishop) on Aug 25, 2020 at 11:01 UTC |
Back to
Seekers of Perl Wisdom