Re^5: Character encoding of microns

if i actually type a micron into the string using Alt-0181 then i get the following output...

Apparently, your editor is operating in ISO-Latin1 mode and is entering the micron as a single byte (181 decimal = B5 hex).

You're then telling Perl that this string is UTF-8 (i.e. the decode("utf8",$clob) statement from oshalla's code), which is incorrect. For this reason, the conversion (silently) fails and the incorrect part (B5 does not start a valid UTF-8 encoding sequence here) is being replaced by the unicode replacement character U+FFFD, which when encoded as UTF-8 produces the three-byte sequence EF BF BD.

When you interpret/display those three bytes as ISO-Latin1 characters they appear as "яПН", i.e. я = EF, П = BF, Н = BD. This is how I (and I suppose everyone else, too) see them in your post, because the PM site isn't unicode aware. If your terminal displays those same three characters, this just means it isn't unicode aware either...

IOW, everything behaves as expected. :)

Comment on Re^5: Character encoding of microns Download Code

Replies are listed 'Best First'.
Re^6: Character encoding of microns by joec_ (Scribe) on Feb 12, 2009 at 09:27 UTC
hi, So, how would i get round the problem of question marks being both displayed in my terminal for microns and also in any output that is written to a file? When i open my output file in a hex editor, a 3F is displayed for the question mark - indicating that an actual ? is written and it isnt a foreign character. No strange chars like above show up. Im think im hitting a brick wall with this. Thanks Joe Eschew obfuscation, espouse eludication!	[reply]
Re^7: Character encoding of microns by graff (Chancellor) on Feb 13, 2009 at 06:10 UTC
The ASCII question mark is typically what you get when something tries to convert some unicode character into some non-unicode character set that does not contain the character in question. For example, the following script will produce "foo??", because the string literal has unicode Cyrillic for the fourth and fifth characters, but perl is being told to convert it to iso-8859-1 (Latin-1), which does not contain any Cyrillic characters -- that is, the unicode code points for Cyrillic cannot be mapped into the single-byte character codes for Latin-1, so the conversion produces "?" instead. `perl -MEncode -le '$_=encode("iso-8859-1","foo\x{041d}\x{0418}"); prin +t'` [download] It's not just perl that does this. Anything/everything that supports conversion between unicode and other encodings will behave the same way when faced with the same inappropriate task. To figure out where the question marks are coming from, figure out the last point where the data were in unicode, and what sort of bad assumption is being made at that point to convert the encoding to something else.	[reply] [d/l]


"be consistent"
	PerlMonks