Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: Perl strings questions

by bliako (Monsignor)
on Jun 02, 2021 at 12:29 UTC ( [id://11133417]=note: print w/replies, xml ) Need Help??


in reply to Re: Perl strings questions
in thread Perl strings questions

I have update my post re: mixing strings

Replies are listed 'Best First'.
Re^3: Perl strings questions
by haj (Vicar) on Jun 02, 2021 at 14:25 UTC

    Thanks, that clarifies some things. Yet, the python code does not mix text and binary.

    As far as I read the code, the binary stuff is BASE64 encoded. Well, yes, unfortunately "encoding" is used for a lot of things.

    Let me try to explain the difference:

    • UTF-8 is an encoding to map unicode characters to bytes. Unicode characters are identified by their code point. For ASCII characters and control characters like LINE FEED, their code point is equal to their "traditional" byte value, and also to their UTF-8 mapping. Perl's interface identifies characters by its code point, so you get a lowercase greek alpha by chr(945) or by "\x{3b1}". You also can use the names as in choroba's example: "\N{GREEK SMALL LETTER ALPHA}".
    • BASE64 is an encoding to map a stream of bytes, each of which in the range 0..255 ("binary data"), to a stream of bytes, each of which representing an ASCII character, The result happens to be valid UTF-8 (see above).

    Binary data will in most cases contain bytes in the range 128..255. Their UTF-8 encoding is not equal to their byte value. If you encode such bytes in UTF-8, it is like Perl interpreting their byte values as code points: Unicode has code points in that range with (not so) surprising similarity to ISO-8859-1. The code point for ö is U+00F6, but its UTF-8 encoding has two bytes X'C3B6'. So, if you encode binary data in UTF-8, the result is different, the process is deterministic and it is reversible.

    However, it depends on the receiving side to do a decoding of an UTF-8 stream into binary data and not into a unicode string. Perl happens to do that (because, as you wrote, it makes no difference), but not many other languages do. In general, you can not decode an UTF-8 stream into binary if it contains one or more characters with a code point greater than 255.

      Re: python does not mix binary and string, well I thought this hashlib.sha256(encoded).digest() was a binary hash. I am positive that I printed it to see, but right now I have no time, so I will update tomorrow. The rest is very useful and I will read it tomorrow.

      UPDATE: the digest() prints out b'+\xbd\xd0Z\xda;\x05\xbb\x80\x058(.' etc

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11133417]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-03-29 12:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found