Re^5: Unicode strings internals

in reply to Re^4: Unicode strings internals
in thread [SOLVED] Unicode strings internals

Ok, you probably mean that some UTF-8 multi-bytes characters (i.e. non-ASCII-7bit characters) are maped to Latin-1 single-bytes encoding

Yes, character mappings are Latin-1 (though I think it's actually CP-1252) by default. Sorry for the lack of clarity.

You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string.

An essential point, from my perspective, in understanding Unicode is making a differentiation between characters and particular encodings of characters; the difference between Unicode/UTF and UTF-8 (or UCS-2 or UTF-EBCDIC or...). The character string doesn't change when the string is upgraded to internal UTF encoding, even though the byte literals do. But this is largely a semantic argument.

before syswrite to raw filehandle

The use of a raw filehandle for output of non-binary data is what throws me in all this. I don't know your particular application, but it seems much more natural to filter for non-ASCII characters by filtering for non-ASCII characters. You can even do better if you are concerned about corrupted data by excluding most control characters:

if ($string =~ /[^\x{9}\x{10}\x{13}\x{20}-\x{126}]/) {
    # someone messed up
}
[download]

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

In Section Seekers of Perl Wisdom