http://qs321.pair.com?node_id=906434


in reply to Re^4: Simplest Possible Way To Disable Unicode
in thread Simplest Possible Way To Disable Unicode

It doesn't say "Wide character".

Specific error message aside, Perl should never treat a number as a 'wide character' without explicit notification from the programmer that that is his intent.

c:\test>perl -we"print chr( 257 )" | wc -c Wide character in print at -e line 1. 2
I've already pointed out the documentation is wrong.

No! You didn't. Nowhere prior to this post anywhere in this thread.

There is no such thing as Unicode number 0x20000, yet

So, the documentation is wrong! And the implementation is (silently) wrong!

That pretty much covers everything. Unicode support in perl is broken.

In Perl, a character is a number in 0 to UVMAX.

And that bullshit is exactly why it is so broken.

Because &^*&% like you will keep on conflating 'numbers' with 'characters'.

  1. UVMAX is cpu dependant.

    Typically 4294967296 or 18446744073709551616, but with other values possible.

  2. The term 'character' has no meaning outside of some mapping.

    Unless a number can be mapped to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language., it is just a number.

    And even when it can be so mapped, until it is mapped, it is still just a number.

    And any suggestion otherwise is just so much bullshit.

  3. And 4294967296, much less 18446744073709551616 cannot be mapped to 'a character' in any known or proposed mapping.

    Which makes this:

    In Perl [or any language], a character is a number in 0 to UVMAX.
    stand out as the total twaddle it is.

Unicode support in Perl is broken. And until people like you stop pretending that it isn't it will stay that way.

Indeed, until those that do, stop trying to pretend that you can transparently handle the abortion that is Unicode, whether retro-fitting an existing language or implementing a new one, the longer it will be before we can evolve some sane semantics for handling MBCSs.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: Simplest Possible Way To Disable Unicode
by ikegami (Patriarch) on May 24, 2011 at 07:49 UTC

    Perl should never treat a number as a 'wide character' without explicit notification from the programmer that that is his intent.

    Judging by your example, I think you mean you don't want wide character to automatically get encoded to UTF-8. (Correct me if I'm wrong.)

    What do you propose instead? I can think of a couple.

    • Dying like syswrite? I'm not sure that's better, but I could easily be convinced.

    • Silently convert the numbers to UTF-8? I definitely want at least a warning if non-bytes is passed to print when warnings are on. I don't care what output it produces. Currently, it also warns when warnings are off. That's not appropriate, but I think that's suppose to change.

    • Silently truncate the high bits? Same reply as previous.

    The term 'character' has no meaning outside of some mapping.

    Characters have no meaning outside a mapping, but the term does. It's simply the basic unit of a string.

    And even when it can be so mapped, until it is mapped, it is still just a number.

    I fully agree. That's why I said pack doesn't deal with Unicode. It just deals with numbers. So do chr, ord, substr, index, etc.

    Operators that do use mappings are lc, \d in regex patterns, etc.

    And 4294967296, much less 18446744073709551616 cannot be mapped to 'a character' in any known or proposed mapping.

    No, but 4294967295 is a valid character.

    >perl -E"say ord chr 4294967295" 4294967295

    Perl uses utf8 (not to be confused with UTF-8), an encoding whose charset consist of 2**72 characters. Only up to UVMAX is supported, though.

    Unicode support in Perl is broken.

    I'm not going to discuss this because this thread has nothing to do with Unicode.

    The OP tried to send non-bytes to a file handle, and you tried to store something larger than a byte in a byte. A warning and dying aren't unwarranted.

      Perl uses utf8 (not to be confused with UTF-8),

      It is so sad to see apparently intelligent men make such stupid statements.

      Of course 'utf8' will be confused with 'UTF-8'. Search for the former using any search engine or on any reference site, and all you will find are references to the latter.

      an encoding whose charset consist of 2**72 characters.

      Wrong! At best this mythical 'utf8' stores 2**72 ordinal values that could be mapped to a charset.

      But as no such charset exists; nor any that contain even 0.000000000000003% of that stupidly huge number, makes the entire thing totally fallacious.

      Numbers are not characters. They are just numbers.

      Those numbers can be code points that can be mapped to characters. But you cannot map a number that is greater than the number of characters that exist.

      And a 'character' has a very clearly defined meaning. Even in the standard you keep (mis)quoting: Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters.

      The fact that you think you know better says it all for me so I'm done. If you're after another 37 levels, you're on your own.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^6: Simplest Possible Way To Disable Unicode
by tchrist (Pilgrim) on May 24, 2011 at 18:59 UTC
    Unicode support in perl is broken.
    That isn’t even vaguely true, let alone concretely true. Just because one person does not understand something, or because another person doesn’t like something, does in no fashion mean that that something is somehow “broken”. To claim otherwise is tantamount to spreading leyendas negras and perilously close to spreading FUD. We need neither of those.

    Having fought my way through the many, many ways that Unicode does not work properly in various other languages like Java, C#, Python, Ruby, PHP, and Javascript, not to mention the original misguided implementation of Unicode support from Perl 5.6 that’s been thankfully redesigned since then, I am completely confident that Perl’s Unicode support is not only not broken, but also that the Unicode support in Perl is superior to that in all those languages I’ve just mentioned.

    Now, it is actually true that Unicode support has improved in the 5.14 release of Perl. However, Unicode support in Perl has been perfectly serviceable for many years now. To pretend that it is “broken” may be misunderstanding, it may be disagreement, and it may be bitter bluster, but it is simply and fundamentally not true.

    It is also misleading and harmful to hear repeated. It helps nothing and only hurts people, people who may be naïvely deceived by this facile deceit. Here is what you should do instead:

    • If you think it should work differently, then submit a patch.
    • If you think there is a bug, then file a bug report.
    • If you are unwilling to take either of those two constructive steps, then please do the world the courtesy of not repeating a simple-minded slogan that is so patently false, misleading, and hurtful.

    Those are the only reasonable choices. If none of those “appeals” to you, then please gain some proper perspective by seriously trying out those other languages’ implementations of Unicode support. Who knows, you might even like them better than you do Perl’s.

    If it irks you to paddle upstream all the time, then turn around and go the other way. Save yourself some grief — and the rest of us, too.

      Unicode support in perl *is* broken.

      If for no other reason than I cannot ignore it. Things that worked before it was added, no longer do. (And there are plenty of other reasons.)

      If that doesn't make sense to you, re-read the thread. If it still doesn't make sense, then you've not read closely enough.

      I might even agree with you that Perl's unicode support is somewhat less broken than in many other languages, but you can't make a silk purse out of a sow's ear.

      The great thing about The Unicode Standard is that there are so many to choose from. Which makes the attempt to transparently support all of them, using an internal encoding that is none of them, heroic, but simply naive.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.