Don't ask to ask, just ask | |
PerlMonks |
Re^3: Character encoding of micronsby gone2015 (Deacon) |
on Feb 07, 2009 at 00:37 UTC ( [id://742047]=note: print w/replies, xml ) | Need Help?? |
As you say, the strings are apparently identical, except that one is a "byte" string while the other is "utf8". Note that in both cases the strings contain the UTF-8 form of micron, this is significant as we will see... What you are seeing when you print to STDOUT takes a little explaining... By default STDOUT will have no encoding associated with it, so Perl will assume that it is LATIN1 (or ISO-8859-1). When you print the "byte" string, Perl sends the bytes, untouched, to STDOUT -- because Perl treats "byte" strings as if they were LATIN1. The two bytes that make up the UTF-8 for micron are passed all the way to the screen. The screen understands UTF-8, so presto! you see the micron character. When you print the "utf8" string, however, Perl knows that it should convert the string to LATIN1. So the two byte UTF-8 sequence 0xC2:0xB5 is converted to the LATIN1 equivalent 0xB5 (!). That is passed all the way to the screen. BUT, since the screen actually understands UTF-8, the lone 0xB5 byte is nonsense to it, so it shows some error character -- in your case, apparently '?', on my screen, something I will describe as a splodge. You can tell STDOUT that it's a UTF-8 file-handle using binmode, so: where the PerlIO::get_layers is returning information about how the file-handle is configured. This produces: clob: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- byte conv: 74:68:69:73:20:69:73:20:73:74:72:69:6E:67:20:77:69:74:68:20:C2:B5:20:69:6E:20:69:74 -- utf8 unix perlio clob: 'this is string with µ in it' conv: 'this is string with ▒ in it' unix perlio encoding(utf-8-strict) utf8 clob: 'this is string with µ in it' conv: 'this is string with µ in it'So now you're asking yourself, where the MUMBLE did the 'µ' come from. Well... $clob is a byte string, which as far as Perl is concerned contains two LATIN1 characters, 0xC2 and 0xB5. Now that it knows that STDOUT is UTF-8, it spots the 0xC2 and encodes it as its UTF-8 equivalent 0xC3:0x82, and it spots the 0xB5 and encodes it as 0xC2:0xB5. And yes, UTF-8 0xC3:0x82 is 'Â'. The message is that you have to be consistent:
But if you try mixing the two, confusion will reign. See PerlIO::encoding, binmode, open and use open for more on encodings and file-handles, and perluniintro for more on Perl and Unicode.
In Section
Seekers of Perl Wisdom
|
|