Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

why no default unicode?

by perl-diddler (Chaplain)
on Mar 19, 2011 at 23:02 UTC ( [id://894199]=perlquestion: print w/replies, xml ) Need Help??

perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

I'm often getting confused by unicode issues -- about when a file stream is in unicode vs. not, but the latest is in a terminal interactive prog where I tried to print a unicode character and got a 'wide-char' error.

Of course I can easily work-around the problem by adding:

binmode STDOUT, ':encoding(UTF-8)'; binmode STDERR, ':encoding(UTF-8)';
to the beginning of my program, but I'm not sure why it isn't *defaulting*. to UTF-8.

I'm running from windows to linux using SecureCRT, which, in its session options, has its 'character encoding' set to UTF-8.

When I log in, if I type locale, I get:

LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
That **looks** like it's saying UTF-8 for a character encoding (this is a Suse11.2 system I'm logging into, BTW, from a Win7 (i.e. unicode supporting) system).

So why is perl *defaulting* to STDOUT being non-unicode?

Why do I need the binmode?

Sorry if this is unicode-first-grade, but this stuff looks like it should be so 'simple' -- yet *blech*. I've had other issues when operating on 'internet data' where I've experienced UTF nightmares, since you don't know the character encoding of the website's response until you look at the header -- which I worked around mostly until perl worked itself into a serious coredump about 3,500,000 statements / 70,000 data lines statements into the program (to which some suggested I get to know "perl -d " ... *cough* ...I do, but not um...trying to track that down -- I just shelved the program to wait for a more reliable perl (I did, FWIW, file a bug against Perl, that has yet to be addressed that I know of).

Any idea why perl isn't just 'doing the right thing' as it is so famous for doing? Thanks...

Replies are listed 'Best First'.
Re: why no default unicode?
by moritz (Cardinal) on Mar 19, 2011 at 23:08 UTC

    There are two good reasons. The first is backwards compatibility. Perl tries very hard not to break old programs, and there are a lot of old programs that would be broken by such a change.

    The second reason is that as it is now, a program as simple as

    while(<>) { print; }

    Just works, ie it print out the same data as it reads. If STDOUT defaulted to UTF-8, it would also need to default to UTF-8 for reading operations.

    And when that's the default, suddenly reading a non-UTF-8 file will cause either a fatal error, or that the data can't be interpreted correctly.

      Ok, I get this...but I copied my 'UTF-8' screen output to a file (inspected it with hexdump -C) and it has the UTF-8 chars in it. When I ran it through your prog, it auto-defaulted to UTF-8!!!

      So then I tried cut/paste directly into perl. Again, the same cut/paste I put into the above file.

      Ran the prog again. Same thing -- no complaint. So now I have it outputting my 'utf-8' characters with no complain, but when I try to do it via perl's unicode facility, it doesn't work.

      Pure guess -- it's interpreting it as a byte stream, so byte in / byte out...perl thinks it's all 'bytes', but the term interprets the input and output as UTF-8 (the term is setup to pass UTF-8 chars through on input as well).

      So basically, if I want to safely use unicode in perl, I need to pre-convert my unicode chars into utf-8 byte-strings, and output them as simple byte strings? ...(yup, that works)...

      I guess I somehow thought that perl would now detect the terminal settings from the local/environment setting and set the unicode-ness of STD(IOER) automatically. Is that something that would be a bad thing for perl to do? I'm sure I'm missing some obvious point(s) somewhere...

        When I ran it through your prog, it auto-defaulted to UTF-8!!!

        It did not, whatever you mean by that.

        Pure guess -- it's interpreting it as a byte stream, so byte in / byte out...perl thinks it's all 'bytes', but the term interprets the input and output as UTF-8 (the term is setup to pass UTF-8 chars through on input as well).

        Exactly.

        I guess I somehow thought that perl would now detect the terminal settings from the local/environment setting and set the unicode-ness of STD(IOER) automatically. Is that something that would be a bad thing for perl to do?

        As I wrote before, it would make it impossible to process binary data (or any non-UTF-8 data) out of the box. People want to do that, independently of whether they are in an UTF-8 console or not.

Re: why no default unicode?
by BrowserUk (Patriarch) on Mar 19, 2011 at 23:16 UTC

    If add an environment variable: set PERL5OPT=-CSD Perl will default STDIN/STDOUT/STDERR & all opens to use UTF-8 by default. Which may or may not be what you want.

    See perlrun -C & Perlrun Environment variables for details.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Well, I'd prefer it to use UTF-8 for the Term if there term is setup to use UTF-8 (i.e. as expressed by local).

      For files, it should probably default to binary unless told otherwise.

      Since all my terms are UTF-8, is how about a way to tell it to use UTF-8 on 'tty' devices, but default to binary on files? Too much intelligence built into the startup code, probably eh?...

      Either that, OR..."auto-switch": if detect widechar on output, then convert to UTF-8 bytes... That would be the most helpful -- since it knows I'm trying to output a wide-char, so it should (IMO) *try* to do the best it can and assume a UTF-8 output device...

      What would be the 'downsides' of that approach? (I.e. instead of the current approach of putting out a warning)...

        Read the linked documentation. If you don't want open to default to utf-8, then set PERL5OPT='-CS'

        As for auto-detecting. There is no way for perl (or any other language) to determine the difference between an input file containing utf-8, and an input file containing arbitrary binary.

        Indeed, there is no way to distinguish between utf-8 and utf-2 or utf-32 or arbitrary binary. In this respect the entire unicode standard is terminally broken.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: why no default unicode?
by repellent (Priest) on Mar 20, 2011 at 07:21 UTC
    Here's the big secret about encodings: You need a priori knowledge of which encoding to use for each specific data stream.

    That means, you cannot effectively auto-detect which encoding to use simply by observing the data stream alone. An out-of-band message may be used to signal which encoding to use, as is typical with web browsers using the HTTP protocol.

    You pointed out your locale settings and suggested that Perl makes use of them. Having Perl set encodings for STDIN/STDOUT/STDERR based on locale would break any data stream that is not encoded as such. Not to mention, applying encoding to many data streams based on global (locale) settings violates what I mentioned earlier.

    By default, Perl assumes every stream has no encoding (1 character per byte) - it's safe (i.e. binary data won't break) and is a reasonable default (i.e. an otherwise implicit encoding based on locale is hard to see).
Re: why no default unicode?
by Eliya (Vicar) on Mar 19, 2011 at 23:55 UTC

    As an alternative to BrowserUk's suggestion, you can set the environment variable PERL_UNICODE="" (i.e. to the empty string), which is equivalent to the command line option -C, which is the same as -CSDL.  The L in there enables the other features (SD) depending on your locale.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://894199]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-18 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found