http://qs321.pair.com?node_id=906613


in reply to Simplest Possible Way To Disable Unicode

hello there!

please dont let this topic die! this is a BIG problem, the situation from my perspective is:

Eras has passed and now we have:

and no one is happy!

Please some one write down the "Travel in Babel's lands with Perl in a pocket" tutorial.
If can i add something I think the used semantic of the english term Encode is a little misleading for non english peoples..

I discovered babel some times ago and i asked for wisdom about length in Size and anatomy of an HTTP response.
As done there I invite everyone intersted to read (after the canonical texts: perluniintro, perlunitut, perlunifaq and perlunicode. )also The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and http://perlgeek.de/en/article/encodings-and-unicode

Lor*

there are no rules, there are no thumbs..
  • Comment on Re: Simplest Possible Way To Disable Unicode

Replies are listed 'Best First'.
Re^2: Simplest Possible Way To Disable Unicode
by BrowserUk (Patriarch) on May 25, 2011 at 10:17 UTC
    a lot of guru coders not so happy beacause some of their pack, syswrite or wha telse spells have lost the shining of primeval eras..

    That simply isn't what is going on here.

    The docs for pack say:

    • C   An unsigned char (octet) value.
    • W   An unsigned char value (can be greater than 255).
    • U   A Unicode character number.  Encodes to a character in character mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in byte mode.

    Now let's see what happens when we assign oversized values to other unsigned types:

    print unpack 'S', pack 'S', 65537;; 1 print unpack 'L', pack 'L', 2**32+1;; 1 print unpack 'Q', pack 'Q', 2**64+1;; 18446744073709551615

    It silently wraps (or truncates) as is expected and normal.

    Contrast that with what now (since the advent of unicode support) happens with unsigned char values:

    print unpack 'C', pack 'C', 2**8+1;; Character in 'C' format wrapped in pack at (eval 17) line 1, <STDIN> l +ine 9. 1

    A dumb warning that can only be disabled by disabling *all* pack warnings. Don't forget the 'W' and 'U' types above.

    It is perfectly reasonable to expect silent truncation of oversized values with unsigned char types ('C'). Just as was the case with 'C' before the addition of unicode support; and just as is still the case with all other unsigned types. This is not an error, nor "sloppy coding"; it is the norm for these types.

    Now constrast this spurious warning with the what happens when you use chr with oversized values:

    $s = chr( 257 );; print do{ use bytes; length $s, unpack 'C*', $s };; 2 196 129

    Perl silently accepts this error, and erroneously constructs a multi-byte character.

    And you only discover this error when you try to print it:

    print $s, length $s;; Wide character in print at (eval 19) line 1, <STDIN> line 11. &#9472;ü 1

    Which may not happen until dozens or hundreds of lines further on into the code; perhaps in another of your source files; perhaps in a module you didn't write or even know that you were (indirectly) using.

    That is the very worst kind of error situation: action at a distance.

    So, the problem is not (only) that this breaks "spells have lost the shining of primeval eras", but rather that the current, here today and tomorrow, state of play is that Perl issues spurious warnings for code that has always (and still should by the evidence of other similar current operations) be considered normal. Whilst silently not just ignoring a possible programmer error, but then making asinine assumptions and implementing the wrong thing, in a way that means such errors are horribly difficult to track down.

    You cannot have it both ways. Fobbing this off with "documentation error" or "ancient sloppy coding practices" doesn't cut it.

    Either *all* oversized assignments to unsigned types should silently truncate; or *all* should warn.

    Either chr should be only for 8-bit bytes and attempts to set oversized values should warn in-situ or chr should accept multi-byte ordinals and print should know how to handle them.

    Except the latter is impossible because Unicode is such a crock.

    One solution would be to add a wchr function that accepted multi-byte ordinals. That would make it very clear that the programmer is expecting to program with MBCSs and allow chr to catch coding errors at source.

    Another, in my opinion preferable, solution would be to have it so that pre-unicode support semantic were followed everywhere, unless a use Unicode; statement was seen.

    Ie. Instead of having to try (and fail) to disable these changes when you don't want them with use bytes;, when you want Unicode semantics, you ask for them. Seem logical?

    Unfortunately, it is too late for that.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      ..ohhh

      I choosed to speak ironically (spell, shine, ..) exactly because I had not a clear idea about what was going on..
      thanks for the explanation.

      Lor*
      there are no rules, there are no thumbs..