Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^3: Is there some universal Unicode+UTF8 switch?

by haj (Vicar)
on Sep 02, 2019 at 12:08 UTC ( [id://11105436]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Is there some universal Unicode+UTF8 switch?
in thread Is there some universal Unicode+UTF8 switch?

> If the JSON Cyrillic is not UTF-8, what encoding is it in?

It is in Unicode. You want UTF-8 - call like this (see formatversion changed to 1): https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=1&list=allusers&auactiveusers&aufrom=Б The major problem of Perl as I see it (see the module name question higher) that it thinks of UTF-8 and Unicode as something of the same kind while these are two completely different things. From here all its (de|en)coding oops. IMHO.

I'm sorry, but this is just plain wrong. Perl knows what UTF-8 is (a representation of Unicode in bytes) and what Unicode (a mapping of characters to numbers) is. This is not a problem of Perl.

If you write that your JSON "is in Unicode" then this makes sense for a Perl string which has been properly decoded. Text strings in Perl can contain Unicode characters. For these strings the term "encoding" doesn't make any sense. Internally Perl might store them as UTF-8 encoded, but this is irrelevant for Perl users and has occasionally led users of Devel::Peek to wrong conclusions about the nature of their data. But you can not store a file (source or data) "in Unicode", and you can not get a HTTP response "in Unicode". Whenever data enter or leave the Perl program or your source code editor, you need to decide for an encoding for Unicode strings. Whenever you "see" a Unicode character in an editor window, a console or a web page: some software had to do the mapping from encoded octets to the unicode code point, and from there to the glyph which is displayed on your screen

That said, some Microsoft programs allow to store "in Unicode" and then write the data in UTF-16-LE encoding. This often leads to confusion, as well as their use of "ANSI encoding" when they mean Windows Codepage 1252. There is no Perl pragma to tell Perl that source files are encoded in UTF-16 nor Windows-CP 1252.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11105436]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-03-28 17:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found