why no default unicode?

perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

I'm often getting confused by unicode issues -- about when a file stream is in unicode vs. not, but the latest is in a terminal interactive prog where I tried to print a unicode character and got a 'wide-char' error.

Of course I can easily work-around the problem by adding:

binmode STDOUT, ':encoding(UTF-8)';
binmode STDERR, ':encoding(UTF-8)';
[download]

to the beginning of my program, but I'm not sure why it isn't *defaulting*. to UTF-8.

I'm running from windows to linux using SecureCRT, which, in its session options, has its 'character encoding' set to UTF-8.

When I log in, if I type locale, I get:

LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
[download]

That **looks** like it's saying UTF-8 for a character encoding (this is a Suse11.2 system I'm logging into, BTW, from a Win7 (i.e. unicode supporting) system).

So why is perl *defaulting* to STDOUT being non-unicode?

Why do I need the binmode?

Sorry if this is unicode-first-grade, but this stuff looks like it should be so 'simple' -- yet *blech*. I've had other issues when operating on 'internet data' where I've experienced UTF nightmares, since you don't know the character encoding of the website's response until you look at the header -- which I worked around mostly until perl worked itself into a serious coredump about 3,500,000 statements / 70,000 data lines statements into the program (to which some suggested I get to know "perl -d " ... *cough* ...I do, but not um...trying to track that down -- I just shelved the program to wait for a more reliable perl (I did, FWIW, file a bug against Perl, that has yet to be addressed that I know of).

Any idea why perl isn't just 'doing the right thing' as it is so famous for doing? Thanks...

Comment on why no default unicode? Select or Download Code

Replies are listed 'Best First'.
Re: why no default unicode? by moritz (Cardinal) on Mar 19, 2011 at 23:08 UTC
There are two good reasons. The first is backwards compatibility. Perl tries very hard not to break old programs, and there are a lot of old programs that would be broken by such a change. The second reason is that as it is now, a program as simple as `while(<>) { print; }` [download] Just works, ie it print out the same data as it reads. If STDOUT defaulted to UTF-8, it would also need to default to UTF-8 for reading operations. And when that's the default, suddenly reading a non-UTF-8 file will cause either a fatal error, or that the data can't be interpreted correctly. Perl 6 - second systems done right	[reply] [d/l]
Re^2: why no default unicode? by perl-diddler (Chaplain) on Mar 19, 2011 at 23:54 UTC
Ok, I get this...but I copied my 'UTF-8' screen output to a file (inspected it with hexdump -C) and it has the UTF-8 chars in it. When I ran it through your prog, it auto-defaulted to UTF-8!!! So then I tried cut/paste directly into perl. Again, the same cut/paste I put into the above file. Ran the prog again. Same thing -- no complaint. So now I have it outputting my 'utf-8' characters with no complain, but when I try to do it via perl's unicode facility, it doesn't work. Pure guess -- it's interpreting it as a byte stream, so byte in / byte out...perl thinks it's all 'bytes', but the term interprets the input and output as UTF-8 (the term is setup to pass UTF-8 chars through on input as well). So basically, if I want to safely use unicode in perl, I need to pre-convert my unicode chars into utf-8 byte-strings, and output them as simple byte strings? ...(yup, that works)... I guess I somehow thought that perl would now detect the terminal settings from the local/environment setting and set the unicode-ness of STD(IOER) automatically. Is that something that would be a bad thing for perl to do? I'm sure I'm missing some obvious point(s) somewhere...	[reply]
Re^3: why no default unicode? by moritz (Cardinal) on Mar 20, 2011 at 07:24 UTC
When I ran it through your prog, it auto-defaulted to UTF-8!!! It did not, whatever you mean by that. Pure guess -- it's interpreting it as a byte stream, so byte in / byte out...perl thinks it's all 'bytes', but the term interprets the input and output as UTF-8 (the term is setup to pass UTF-8 chars through on input as well). Exactly. I guess I somehow thought that perl would now detect the terminal settings from the local/environment setting and set the unicode-ness of STD(IOER) automatically. Is that something that would be a bad thing for perl to do? As I wrote before, it would make it impossible to process binary data (or any non-UTF-8 data) out of the box. People want to do that, independently of whether they are in an UTF-8 console or not. Perl 6 - second systems done right	[reply]
Re: why no default unicode? by BrowserUk (Patriarch) on Mar 19, 2011 at 23:16 UTC
If add an environment variable: `set PERL5OPT=-CSD` Perl will default STDIN/STDOUT/STDERR & all opens to use UTF-8 by default. Which may or may not be what you want. See perlrun -C & Perlrun Environment variables for details. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: why no default unicode? by perl-diddler (Chaplain) on Mar 20, 2011 at 00:00 UTC
Well, I'd prefer it to use UTF-8 for the Term if there term is setup to use UTF-8 (i.e. as expressed by local). For files, it should probably default to binary unless told otherwise. Since all my terms are UTF-8, is how about a way to tell it to use UTF-8 on 'tty' devices, but default to binary on files? Too much intelligence built into the startup code, probably eh?... Either that, OR..."auto-switch": if detect widechar on output, then convert to UTF-8 bytes... That would be the most helpful -- since it knows I'm trying to output a wide-char, so it should (IMO) try to do the best it can and assume a UTF-8 output device... What would be the 'downsides' of that approach? (I.e. instead of the current approach of putting out a warning)...	[reply]
Re^3: why no default unicode? by BrowserUk (Patriarch) on Mar 20, 2011 at 00:10 UTC
Read the linked documentation. If you don't want open to default to utf-8, then set `PERL5OPT='-CS'` As for auto-detecting. There is no way for perl (or any other language) to determine the difference between an input file containing utf-8, and an input file containing arbitrary binary. Indeed, there is no way to distinguish between utf-8 and utf-2 or utf-32 or arbitrary binary. In this respect the entire unicode standard is terminally broken. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: why no default unicode? by moritz (Cardinal) on Mar 20, 2011 at 07:27 UTC
Well, I'd prefer it to use UTF-8 for the Term if there term is setup to use UTF-8 (i.e. as expressed by local). Then try `use open IO => ':locale';` [download] Perl 6 - second systems done right	[reply] [d/l]
Re: why no default unicode? by repellent (Priest) on Mar 20, 2011 at 07:21 UTC
Here's the big secret about encodings: You need a priori knowledge of which encoding to use for each specific data stream. That means, you cannot effectively auto-detect which encoding to use simply by observing the data stream alone. An out-of-band message may be used to signal which encoding to use, as is typical with web browsers using the HTTP protocol. You pointed out your locale settings and suggested that Perl makes use of them. Having Perl set encodings for STDIN/STDOUT/STDERR based on locale would break any data stream that is not encoded as such. Not to mention, applying encoding to many data streams based on global (locale) settings violates what I mentioned earlier. By default, Perl assumes every stream has no encoding (1 character per byte) - it's safe (i.e. binary data won't break) and is a reasonable default (i.e. an otherwise implicit encoding based on locale is hard to see).	[reply]
Re: why no default unicode? by Eliya (Vicar) on Mar 19, 2011 at 23:55 UTC
As an alternative to BrowserUk's suggestion, you can set the environment variable `PERL_UNICODE=""` (i.e. to the empty string), which is equivalent to the command line option `-C`, which is the same as `-CSDL`. The `L` in there enables the other features (`SD`) depending on your locale.	[reply] [d/l] [select]


Syntactic Confectionery Delight
	PerlMonks