http://qs321.pair.com?node_id=1120759

Sec has asked for the wisdom of the Perl Monks concerning the following question:

I'm at a loss. Whenever I try to handle unicode/utf8 stuff in perl I hit a wall on how to do it in a sane way.

Please tell me that I'm missing something here.

My goals are:

Text read from stdin, written to stdout and arguments on the commandline should respect the current user locale.

Source code is in a fixed format (usually utf8) Files/pipes should be in the format I specify.

My example script:

#!/usr/bin/perl use strict; use warnings; use utf8; use open qw(:std :locale); open (my $in,"-|:encoding(utf8)","echo \xc3\xb6") || die ; my $line=<$in>; chomp($line); print "I read a line, that is ",length($line)," chars long.\n"; print "That line is: ",$line,"\n"; $line =~ s/ö/o/; print "That line in ascii is: $line\n";
Let's run it:
karoshi:~>LC_CTYPE=de_DE.UTF-8 ./u8demo.pl I read a line, that is 1 chars long. That line is: ö That line in ascii is: o karoshi:~>LC_CTYPE=C ./u8demo.pl ascii "\xC3" does not map to Unicode at ./u8demo.pl line 12. ascii "\xB6" does not map to Unicode at ./u8demo.pl line 12. I read a line, that is 8 chars long. That line is: \xC3\xB6 That line in ascii is: \xC3\xB6
The second case fails horribly. I have no idea why. If I comment the "use open" line, it (of course) fails printing the umlauts on any utf-8 terminal
karoshi:~>./u8demo.pl I read a line, that is 1 chars long. That line is: &#65533; That line in ascii is: o
Is there a way to get perl to "do the right thing"?

Replies are listed 'Best First'.
Re: How to sanely handle unicode in perl?
by choroba (Cardinal) on Mar 20, 2015 at 17:24 UTC
    Switching to binmode and manually picking the encoding from the $ENV seems to work for me:
    #!/usr/bin/perl use strict; use warnings; open my $in, '-|:encoding(utf8)', "echo \xc3\xb6" or die $!; my $enc = $ENV{LC_ALL}; $enc =~ s/.*\.//; # TODO: en_US with no encoding not handled. binmode STDOUT, "encoding($enc)"; my $line = <$in>; chomp $line; print "I read a line, that is ", length $line, " chars long.\n"; print "That line is: $line\n"; $line =~ s/\x{f6}/o/; print "That line in ascii is: $line\n";

    Update

    It seems all that's needed in your original code is to remove the local encoding from the input:
    open my $in, '-|:raw:encoding(utf8)', "echo \xc3\xb6" or die $!;
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Thanks. With ":raw" prepended it works just like it should.

      Fantastic!

Re: How to sanely handle unicode in perl?
by choroba (Cardinal) on Mar 20, 2015 at 16:26 UTC
    If "the right thing" is to accept UTF-8 whatever the locale is, don't count on locale. Modify line 6 to
    use open 'utf8', ':std';
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      That does not work. It has the same effect as commenting out the "use open" line. Try it in an UTF-8 terminal with LC_CTYPE set properly to utf-8. The print to STDOUT corrupts the character.
      karoshi:~>LC_CTYPE=de_DE.UTF-8 ./u8demo.pl I read a line, that is 1 chars long. That line is: &#65533; That line in ascii is: o

        You took the locale bit out, right? Both examples from choroba and me/tchrist work in my UTF-8 terminal.

        moo@cow~>echo $LC_CTYPE
        utf-8
        moo@cow~>perl ~/pm-eg
        I read a line, that is 1 chars long.
        That line is: ö
        That line in ascii is: o
        

        Update: this works the same for me as well: setenv LC_CTYPE de_DE.UTF-8

Re: How to sanely handle unicode in perl?
by Your Mother (Archbishop) on Mar 20, 2015 at 16:39 UTC

    There is a very deep discussion here—tchrist on UTF-8 and Unicode issues in Perl—which includes many good recommendations and suggested defaults between the exhaustive details including a suggestion similar to choroba’s–

    use open qw( :encoding(UTF-8) :std );
      This does also not solve my problem. I want perl to respect the locale of the user calling that script.

      If I use your "open" statement and run the script in an iso8859-1 terminal, i get the following:

      karoshi:~>LC_CTYPE=de_DE.ISO-8859-1 ./u8demo.pl I read a line, that is 1 chars long. That line is: ö That line in ascii is: o
      which is clearly incorrect.

        See point 14 in Assume Brokeness of the link I gave — “Code that assumes Unicode gives a fig about POSIX locales is broken.”

Re: How to sanely handle unicode in perl?
by Anonymous Monk on Mar 21, 2015 at 06:17 UTC
    The second case fails horribly. I have no idea why.
    FYI, "ö" is not "o" in ASCII. "ö" doesn't exist in ASCII. \xC3 and \xB6 don't exist (have no meaning) in ASCII either. When you specify "LC_CTYPE=C", Perl doesn't know what the bytes \xC3 and \xB6 are supposed to be (that's why it complains that it cannot map them to Unicode).

    If that's any consolation, it's IMPOSSIBLE to sanely mix one byte encodings (256 symbols max) and Unicode (> 1000000 codepoints). Perl itself is a great example of this (but that has been already discussed to death on this forum...)

      oh, btw... the least painful way to handle this is to ask the user about preferred encoding. Use utf-8 by default, but let your program accept a command line option to change encoding, something like ./u8demo.pl -encoding=latin-1 ...
      If you check the source I posted, the open specifies ":encoding(utf8)". And with that \xC3\xB6 does exist and is valid. So I don't really understand what you are talking about.
        I'm talking about locale (from use open qw(:std :locale)). encoding doesn't override locale (maybe it should? but it doesn't. They basically stack). Note that using :raw simply removes the locale layer (like removing use open ... entirely, because by default Perl ignores locales... for the most part).
Re: How to sanely handle unicode in perl?
by Khen1950fx (Canon) on Mar 21, 2015 at 11:45 UTC
    Here's a simpler version of choroba's idea:
    #!/usr/bin/perl -l use strict; use warnings; open my $in, "-|:encoding(UTF-8)", "echo \xc3\xb6" or die $!; my $line = <$in>; chomp($line); open STDOUT, ">-" or die $!; binmode STDOUT, ":encoding(UTF-8)"; print STDOUT "I read a line, that is ", length($line), " chars long.\n +"; print STDOUT "That line in ascii is: $line"; close($in); close(STDOUT); exit 0;
    Updated: Fixed mistake at line 11. Thanks, choroba!
      Line 11 makes no sense. When you add the failure handling, you'll know why:
      open STDOUT, ":encoding(UTF-8)" or die $!;
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Thanks for noticing. Of course, it makes no sense. My editor went a wee bit wonky on me. My bad---That's what I get for not looking at it again afterwards:).