Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^5: Character encoding of microns

by almut (Canon)
on Feb 10, 2009 at 14:56 UTC ( [id://742779]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Character encoding of microns
in thread Character encoding of microns

if i actually type a micron into the string using Alt-0181 then i get the following output...

Apparently, your editor is operating in ISO-Latin1 mode and is entering the micron as a single byte (181 decimal = B5 hex).

You're then telling Perl that this string is UTF-8 (i.e. the decode("utf8",$clob) statement from oshalla's code), which is incorrect. For this reason, the conversion (silently) fails and the incorrect part (B5 does not start a valid UTF-8 encoding sequence here) is being replaced by the unicode replacement character U+FFFD, which when encoded as UTF-8 produces the three-byte sequence EF BF BD.

When you interpret/display those three bytes as ISO-Latin1 characters they appear as "�", i.e. ï = EF, ¿ = BF, ½ = BD. This is how I (and I suppose everyone else, too) see them in your post, because the PM site isn't unicode aware. If your terminal displays those same three characters, this just means it isn't unicode aware either...

IOW, everything behaves as expected. :)

Replies are listed 'Best First'.
Re^6: Character encoding of microns
by joec_ (Scribe) on Feb 12, 2009 at 09:27 UTC
    hi,

    So, how would i get round the problem of question marks being both displayed in my terminal for microns and also in any output that is written to a file? When i open my output file in a hex editor, a 3F is displayed for the question mark - indicating that an actual ? is written and it isnt a foreign character. No strange chars like above show up.

    Im think im hitting a brick wall with this.

    Thanks

    Joe

    Eschew obfuscation, espouse eludication!
      The ASCII question mark is typically what you get when something tries to convert some unicode character into some non-unicode character set that does not contain the character in question.

      For example, the following script will produce "foo??", because the string literal has unicode Cyrillic for the fourth and fifth characters, but perl is being told to convert it to iso-8859-1 (Latin-1), which does not contain any Cyrillic characters -- that is, the unicode code points for Cyrillic cannot be mapped into the single-byte character codes for Latin-1, so the conversion produces "?" instead.

      perl -MEncode -le '$_=encode("iso-8859-1","foo\x{041d}\x{0418}"); prin +t'
      It's not just perl that does this. Anything/everything that supports conversion between unicode and other encodings will behave the same way when faced with the same inappropriate task.

      To figure out where the question marks are coming from, figure out the last point where the data were in unicode, and what sort of bad assumption is being made at that point to convert the encoding to something else.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://742779]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-24 23:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found