http://qs321.pair.com?node_id=1032975


in reply to [SOLVED] Unicode strings internals

If I'm reading your code correctly, the issue is that in your first case you have a properly formatted Perl string that contains UTF characters, but in the second you have a UTF-8 byte string, not a character string. The difference is discussed a bit in perluniintro and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

By explicitly invoking encode ("UTF-8", ..., the mixed string contains bytes with the high bit set, but not UTF-specific characters. Outputting a byte string as binary is natural, but outputting a Perl string that contains wide characters does not map without specifying an encoding.

Does this clarify? If you describe the task you are trying to accomplish, we can probably help with the appropriate set of I/O specifications.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^2: Unicode strings internals
by vsespb (Chaplain) on May 10, 2013 at 17:03 UTC
    in your first case you have a properly formatted Perl string that contains UTF characters
    Yes
    but in the second you have a UTF-8 byte string, not a character string.
    No. Second case does look like UTF-8 character string, because it prints "UTF IS ON" and "LENGTH DIFFERS"
      Note that if you modify line 8 to
      my $ascii_but_utf = '123';
      the output changes to
      SV = PV(0x22ae1d0) at 0x2300b20 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x22d21f0 "123\321\204\321\204\321\204\321\204"\0 CUR = 11 LEN = 16 ALL OK
      This is because that UTF is on is just a historical artifact of your initialization.

      If we take a look at the two output files generated by these two cases, you'll note that both contain 11 bytes, despite the fact that the byte dump of the UTF-upgraded case should have output 19 bytes. This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output. You wouldn't expect these 1-byte characters to output a wide-character warning any more that you'd expect an ASCII character to.

      Second case does look like UTF-8 character string,

      You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string. When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output.

        Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding (which is "default" in some cases). They are also have codes greater than 127 but less than 255 in Unicode standard.

        Indeed. "Wide character" warning is not thrown when data can be mapped to Latin-1 (single byte encoding). Also data is written in Latin-1, not UTF-8 in this case (different bytes, it's often not what is expected).

        That answers my question. I was wrong in assumption that "Wide character" warning is thrown for any non-ASCII-7bit characters, contained in character strings.

        What is a "wide character"?
        This is a term used both for characters with an ordinal value greater than 127, characters with an ordinal value greater than 255, or any character occupying more than one byte, depending on the context. The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead.
        This:
        You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.
        I do not agree. My understanding that strings with UTF-8 bit on and with length()<>bytes::length() are character strings with non-ASCII-7bit characters. (one might want to check valid() also)
        When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.
        Yes, agree. This is in faq too.
        Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.
        But it looks to me that is_utf8, length, bytes::length still can be used before syswrite to raw filehandle, to detect that data is broken (i.e. detect programmer mistake on early stage). I.e. in this case either syswrite will terminate or will write data as latin-1 (which is what is not expected in my case).