comment on

This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output.

Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding (which is "default" in some cases). They are also have codes greater than 127 but less than 255 in Unicode standard.

Indeed. "Wide character" warning is not thrown when data can be mapped to Latin-1 (single byte encoding). Also data is written in Latin-1, not UTF-8 in this case (different bytes, it's often not what is expected).

That answers my question. I was wrong in assumption that "Wide character" warning is thrown for any non-ASCII-7bit characters, contained in character strings.

What is a "wide character"?
This is a term used both for characters with an ordinal value greater than 127, characters with an ordinal value greater than 255, or any character occupying more than one byte, depending on the context. The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead.

This:

You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.

I do not agree. My understanding that strings with UTF-8 bit on and with length()<>bytes::length() are character strings with non-ASCII-7bit characters. (one might want to check valid() also)

When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation.

Yes, agree. This is in faq too.

Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.

But it looks to me that is_utf8, length, bytes::length still can be used before syswrite to raw filehandle, to detect that data is broken (i.e. detect programmer mistake on early stage). I.e. in this case either syswrite will terminate or will write data as latin-1 (which is what is not expected in my case).

In reply to Re^4: Unicode strings internals by vsespb
in thread [SOLVED] Unicode strings internals by vsespb

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks