How to sanely handle unicode in perl?

Sec has asked for the wisdom of the Perl Monks concerning the following question:

I'm at a loss. Whenever I try to handle unicode/utf8 stuff in perl I hit a wall on how to do it in a sane way.

Please tell me that I'm missing something here.

My goals are:

Text read from stdin, written to stdout and arguments on the commandline should respect the current user locale.

Source code is in a fixed format (usually utf8) Files/pipes should be in the format I specify.

My example script:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use open qw(:std :locale);

open (my $in,"-|:encoding(utf8)","echo \xc3\xb6") || die ;

my $line=<$in>;

chomp($line);

print "I read a line, that is ",length($line)," chars long.\n";

print "That line is: ",$line,"\n";
$line =~ s/ö/o/;
print "That line in ascii is: $line\n";
[download]

Let's run it:

karoshi:~>LC_CTYPE=de_DE.UTF-8 ./u8demo.pl
I read a line, that is 1 chars long.
That line is: ö
That line in ascii is: o
karoshi:~>LC_CTYPE=C ./u8demo.pl          
ascii "\xC3" does not map to Unicode at ./u8demo.pl line 12.
ascii "\xB6" does not map to Unicode at ./u8demo.pl line 12.
I read a line, that is 8 chars long.
That line is: \xC3\xB6
That line in ascii is: \xC3\xB6
[download]

The second case fails horribly. I have no idea why. If I comment the "use open" line, it (of course) fails printing the umlauts on any utf-8 terminal

karoshi:~>./u8demo.pl
I read a line, that is 1 chars long.
That line is: &#65533;
That line in ascii is: o
[download]

Is there a way to get perl to "do the right thing"?

Comment on How to sanely handle unicode in perl? Select or Download Code

Replies are listed 'Best First'.
Re: How to sanely handle unicode in perl? by choroba (Cardinal) on Mar 20, 2015 at 17:24 UTC
Switching to binmode and manually picking the encoding from the $ENV seems to work for me: `#!/usr/bin/perl use strict; use warnings; open my $in, '-\|:encoding(utf8)', "echo \xc3\xb6" or die $!; my $enc = $ENV{LC_ALL}; $enc =~ s/.*\.//; # TODO: en_US with no encoding not handled. binmode STDOUT, "encoding($enc)"; my $line = <$in>; chomp $line; print "I read a line, that is ", length $line, " chars long.\n"; print "That line is: $line\n"; $line =~ s/\x{f6}/o/; print "That line in ascii is: $line\n";` [download] Update It seems all that's needed in your original code is to remove the local encoding from the input: `open my $in, '-\|:raw:encoding(utf8)', "echo \xc3\xb6" or die $!;` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: How to sanely handle unicode in perl? by Sec (Monk) on Mar 23, 2015 at 10:19 UTC
Thanks. With ":raw" prepended it works just like it should. Fantastic!	[reply]
Re: How to sanely handle unicode in perl? by choroba (Cardinal) on Mar 20, 2015 at 16:26 UTC
If "the right thing" is to accept UTF-8 whatever the locale is, don't count on locale. Modify line 6 to `use open 'utf8', ':std';` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: How to sanely handle unicode in perl? by Sec (Monk) on Mar 20, 2015 at 16:38 UTC
That does not work. It has the same effect as commenting out the "use open" line. Try it in an UTF-8 terminal with LC_CTYPE set properly to utf-8. The print to STDOUT corrupts the character. `karoshi:~>LC_CTYPE=de_DE.UTF-8 ./u8demo.pl I read a line, that is 1 chars long. That line is: � That line in ascii is: o` [download]	[reply] [d/l]
Re^3: How to sanely handle unicode in perl? by Your Mother (Archbishop) on Mar 20, 2015 at 16:43 UTC
You took the locale bit out, right? Both examples from choroba and me/tchrist work in my UTF-8 terminal. moo@cow~>echo $LC_CTYPE utf-8 moo@cow~>perl ~/pm-eg I read a line, that is 1 chars long. That line is: ö That line in ascii is: o Update: this works the same for me as well: `setenv LC_CTYPE de_DE.UTF-8`	[reply] [d/l]
Re^4: How to sanely handle unicode in perl? by Sec (Monk) on Mar 20, 2015 at 17:12 UTC
Re: How to sanely handle unicode in perl? by Your Mother (Archbishop) on Mar 20, 2015 at 16:39 UTC
There is a very deep discussion here—tchrist on UTF-8 and Unicode issues in Perl—which includes many good recommendations and suggested defaults between the exhaustive details including a suggestion similar to choroba’s– `use open qw( :encoding(UTF-8) :std );`	[reply] [d/l]
Re^2: How to sanely handle unicode in perl? by Sec (Monk) on Mar 20, 2015 at 16:46 UTC
This does also not solve my problem. I want perl to respect the locale of the user calling that script. If I use your "open" statement and run the script in an iso8859-1 terminal, i get the following: `karoshi:~>LC_CTYPE=de_DE.ISO-8859-1 ./u8demo.pl I read a line, that is 1 chars long. That line is: Ã¶ That line in ascii is: o` [download] which is clearly incorrect.	[reply] [d/l]
Re^3: How to sanely handle unicode in perl? by Your Mother (Archbishop) on Mar 20, 2015 at 16:50 UTC
See point 14 in Assume Brokeness of the link I gave — “Code that assumes Unicode gives a fig about POSIX locales is broken.”	[reply]
Re^4: How to sanely handle unicode in perl? by Sec (Monk) on Mar 20, 2015 at 16:56 UTC
Re^5: How to sanely handle unicode in perl? by Your Mother (Archbishop) on Mar 20, 2015 at 19:10 UTC
Some notes below your chosen depth have not been shown here
Re^5: How to sanely handle unicode in perl? by soonix (Canon) on Mar 21, 2015 at 22:34 UTC
Re: How to sanely handle unicode in perl? by Anonymous Monk on Mar 21, 2015 at 06:17 UTC
The second case fails horribly. I have no idea why. FYI, "ö" is not "o" in ASCII. "ö" doesn't exist in ASCII. `\xC3` and `\xB6` don't exist (have no meaning) in ASCII either. When you specify "LC_CTYPE=C", Perl doesn't know what the bytes \xC3 and \xB6 are supposed to be (that's why it complains that it cannot map them to Unicode). If that's any consolation, it's IMPOSSIBLE to sanely mix one byte encodings (256 symbols max) and Unicode (> 1000000 codepoints). Perl itself is a great example of this (but that has been already discussed to death on this forum...)	[reply] [d/l] [select]
Re^2: How to sanely handle unicode in perl? by Anonymous Monk on Mar 21, 2015 at 10:13 UTC
oh, btw... the least painful way to handle this is to ask the user about preferred encoding. Use utf-8 by default, but let your program accept a command line option to change encoding, something like `./u8demo.pl -encoding=latin-1 ...`	[reply] [d/l]
Re^2: How to sanely handle unicode in perl? by Sec (Monk) on Mar 23, 2015 at 10:16 UTC
If you check the source I posted, the open specifies ":encoding(utf8)". And with that \xC3\xB6 does exist and is valid. So I don't really understand what you are talking about.	[reply]
Re^3: How to sanely handle unicode in perl? by Anonymous Monk on Mar 24, 2015 at 00:06 UTC
I'm talking about `locale` (from `use open qw(:std :locale)`). `encoding` doesn't override `locale` (maybe it should? but it doesn't. They basically stack). Note that using `:raw` simply removes the `locale` layer (like removing `use open ...` entirely, because by default Perl ignores locales... for the most part).	[reply] [d/l] [select]
Re^4: How to sanely handle unicode in perl? by Anonymous Monk on Mar 24, 2015 at 00:24 UTC
Re: How to sanely handle unicode in perl? by Khen1950fx (Canon) on Mar 21, 2015 at 11:45 UTC
Here's a simpler version of choroba's idea: `#!/usr/bin/perl -l use strict; use warnings; open my $in, "-\|:encoding(UTF-8)", "echo \xc3\xb6" or die $!; my $line = <$in>; chomp($line); open STDOUT, ">-" or die $!; binmode STDOUT, ":encoding(UTF-8)"; print STDOUT "I read a line, that is ", length($line), " chars long.\n +"; print STDOUT "That line in ascii is: $line"; close($in); close(STDOUT); exit 0;` [download] Updated: Fixed mistake at line 11. Thanks, choroba!	[reply] [d/l]
Re^2: How to sanely handle unicode in perl? by choroba (Cardinal) on Mar 21, 2015 at 18:12 UTC
Line 11 makes no sense. When you add the failure handling, you'll know why: `open STDOUT, ":encoding(UTF-8)" or die $!;` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^3: How to sanely handle unicode in perl? by Khen1950fx (Canon) on Mar 21, 2015 at 21:23 UTC
Thanks for noticing. Of course, it makes no sense. My editor went a wee bit wonky on me. My bad---That's what I get for not looking at it again afterwards:).	[reply]

Back to Seekers of Perl Wisdom

Update