Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: UTF-8 and Unicode the hard way

by Anonymous Monk
on May 09, 2022 at 16:49 UTC ( [id://11143705]=note: print w/replies, xml ) Need Help??


in reply to Re: UTF-8 and Unicode the hard way
in thread UTF-8 and Unicode the hard way

Hmm. Well, that doesn't work either, though.

Using
$answer = encode("UCS-2BE", $answer);
results in \u0000 in front of EVERY character in output ...

But using
$answer = decode('UTF-8', $answer);
produces a "wide character in output" error.

Replies are listed 'Best First'.
Re^3: UTF-8 and Unicode the hard way
by haj (Vicar) on May 09, 2022 at 20:51 UTC

    Corion provided the correct answer, but you failed to verify it. If you print a decoded non-ASCII character, then you get the wide character warning. This is exactly what happens when you print the result of your own substitutions:

    $ perl -E "print qq(\x{100})"
    Wide character in print at -e line 1.
    Ā
    

    Printed output needs to be encoded into a byte stream which the receiving side is able to understand. In many cases like contemporary Unix terminals, UTF-8 is a good guess - which is the encoding your $answer came from.

      (OP here -- sorry, I should have obtained a username before starting this)

      Sorry also for the delay, I stopped to do more tests based on the useful information you've all given me. The "wide character" error happens if I'm trying to decode a string that already has what I have been calling "unicode" extended characters in it. (Remember I said this is all a great mystery to me and I'd really like it to all go away forever? That includes my incorrect terminology.) That is, if it's already got characters such as \x{103}, trying to decode them will produce that error. This turns out to be because one of my data sources sends extended characters in one format and one in a different format (this is an API that has to merge data from several sources for a single output stream).

      Or, more concretely: One of my data sources sends lower-case-a-with-breve ă as \xc4\x83, which is the kind that does need translating for my purposes, and the other data source sends it as \x{103}, which for my purposes is already translated into the format I need. decode('UTF-8') works properly on the former and errors on the latter, which seems to be correct behavior based on what you've said. I didn't realize the two data sources were doing it differently (neither of them has any documentation of what they do, alas) and I picked the wrong horse for my previous test.

      The reason I was calling the former of those "UTF-8" and the latter "Unicode" was because of pages like https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&names=-&utf8=string-literal, where I look up characters for when I have to translate them by brute force ... I'm still not sure of the correct term for the longer Unicode encoding where ă is \x{103} (AKA U+0103).

      Anyway, thank you! It does look like decode() does the right thing, when its user isn't dumb.

        For your given example, this utility might prove illuminating. You can see that the character you describe has the hex code point 0103 and is constructed of the hex bytes c483. This is what each of the constructions in your 2 data sources are referring to. You will have to treat the two sources differently if you want to handle them both successfully.


        🦛

        The writeup tips do say <i> tags are allowed, so I have no idea why it tried to make that first italicized paragraph into a node link. Sorry about that.
Re^3: UTF-8 and Unicode the hard way
by Corion (Patriarch) on May 09, 2022 at 17:54 UTC

    You shouldn't get a wide character in output error from your call to decode(...). Can you please show the relevant code and data that produces that output?

      produces a "wide character in output" error.

      hmm, are you saying your string is a garbage mix of UTF-8 and Unicode Code Points? Please provide the output of

      use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper($s));

      $answer = encode("UCS-2BE", $answer); results in \u0000 in front of EVERY character in output

      encode does not add the 6-charater string \u0000, but it should indeed add a zero byte in front of ASCII characters (and more).


      [Sorry, this was meant to be a reply to the OP, not Corion]

Re^3: UTF-8 and Unicode the hard way
by Anonymous Monk on May 10, 2022 at 07:26 UTC

    What do you mean by "unicode", then? By itself, Unicode is a correspondence between characters (i.e. "Ы", CYRILLIC CAPITAL LETTER YERU) and integers (in this case, 0x042B or 1067), except unlike ASCII or other single-byte encodings, the integers aren't limited to the range of [0, 255]. Do you need to transform the UTF-8-encoded text into an array of integers? In order to represent Unicode code points as bytes, you need a Unicode encoding; UTF-8, UTF-16, UCS-2BE are all examples of those.

    Perl has a native representation for Unicode code points, which it calls wide characters: for Perl programs, they are like bytes, but have values above 255 and are interpreted according to Unicode rules. Since files and input/output streams contain bytes, not Unicode wide characters, there needs to be an additional encoding layer between them and Unicode. Which is why it's important to read at least perlunitut before attempting to work with Unicode in Perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11143705]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-19 02:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found