Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Lost in compressed encodings

by Corion (Pope)
on Apr 06, 2020 at 08:41 UTC ( #11115121=note: print w/replies, xml ) Need Help??

in reply to Lost in compressed encodings

The order of decompressing and decoding matters. You want to first uncompress and then decode. If you want to cheat, you can use PerlIO::gzip:

my $in; my $open_mode = '<:raw'; if ($filename=~/\.gz$/) { $open_mode .= ':gzip'; } $open_mode .= ':utf8'; open my $in, $open_mode, $filename or die "Can't read $filename: $ +!\n";

If you want to stay with IO::Uncompress::Gunzip, I think the following should work, but I don't know if ->binmode() also applies other encodings properly:

my $in; if ($filename=~/\.gz$/) { $in = new IO::Uncompress::Gunzip $in, { AutoClose => 1 }; } else { open $in, '<:raw', $filename or die "Can't read $filename: $!\ +n"; }; binmode $in, ':utf8';

Replies are listed 'Best First'.
Re^2: Lost in compressed encodings
by Skeeve (Parson) on Apr 06, 2020 at 08:53 UTC

    Thanks Corion. I already had the feeling that the sequence somehow is the issue.

    Unfortunately providing binmode after IO::Uncompress did not help.

    My changed code:

    open my $in, '<:raw', $filename or die "Can't read $filename: $!\n +"; if ($filename=~/\.gz$/) { $in= new IO::Uncompress::Gunzip $in, { AutoClose => 1 }; } binmode $in, ':utf8';

    It still works with uncompressed and not with compressed data.

    Seems I will have to manually decode each line…


      I think that IO::Uncompress::Gunzip only understands ->binmode() and not ->binmode(':utf8');. The documentation (now that I read it ...) even says:

      This is a noop provided for completeness.

      If you are able to install PerlIO::gzip, that one should work with stacking other decoding mechanisms on top of it.

      If you have a gzip binary available, you can use that to decompress:

      my $in; if( $filename =~ /\.gz$/ ) { open $fh, "gzip -cd "$filename" |' or die "Can't read from gzip $filename: $!/$?"; } else { open $in, '<:raw', $filename or die "Can't read $filename: $!\n"; }; binmode $fh, ':utf8';

        Thanks for all your suggestions, Corion.

        As my usecase is a module which reads a (kind of) CSV file, there are just 2 places, where I read a line from the file.

        So I decied to do the "manual" decoding:

        open my $in, '<:raw', $filename or die "Can't read $filename: $!\n"; if ($filename=~/\.gz$/) { $in= new IO::Uncompress::Gunzip $in, { AutoClose => 1 }; } # later on $_= <$in>; chomp; @headers= split /\t/, lc decode('UTF-8' => $_); # That's not really required as the header will always be # ASCII-characters… But for completeness sake… # further down I have a loop while (<$in>) { chomp; # … @line{@headers}= split /\t/, decode('UTF-8' => $_);


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11115121]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2020-06-01 17:28 GMT
Find Nodes?
    Voting Booth?
    Do you really want to know if there is extraterrestrial life?

    Results (5 votes). Check out past polls.