Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Perl strings questions

by haj (Vicar)
on Jun 02, 2021 at 11:05 UTC ( [id://11133410]=note: print w/replies, xml ) Need Help??


in reply to Perl strings questions

Corion and choroba have given good advice, so just some extras:
my $unicode_string = "αβγαabc123";
# unicode chars mixed with lower-ascii

This is not a unicode string. (choroba notes below that it originally was, but PerlMonks mangles these within code blocks) It is a HTML encoding of a unicode string in plain ASCII. To get a unicode string from this, do:

use HTML::Entities; my $html_string = "αβγαabc123"; my $unicode_string = decode_entities($html_string);
2) Also, I have a question about why I need to encode in "UTF-8". Does that make sure that the "double-bytes" and possible "single-bytes" are all becoming a stream of "single-bytes"?

After UTF-8 encoding, you end up with a stream of single-byte characters.

However, you do not strictly need to encode in UTF-8. The text encoding is an agreement between the sender and the receiver, this can either be done by explicit specification (example: Content-Type: text/html; encoding=utf8), by some standard or defaults (examples: XML defaults to UTF-8, HTML defaults to ISO-8859-1), or just by the developers talking to each other over a beer (not recommended). These days, UTF-8 is highly recommended because it is able to represent any unicode character in a consistent way.

I am trying to find the safe way to do things when strings are mixed,
My recommendation: Just don't do that. Text and binary data don't mix well in a simple string. Finding the borders is hard to do in a safe way since binary data may occasionally look like text.

Replies are listed 'Best First'.
Re^2: Perl strings questions
by choroba (Cardinal) on Jun 02, 2021 at 11:10 UTC
    > This is not a unicode string

    It was, but PerlMonks can't display it in a <code> block :-(

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Thanks for the info! I've added it to my article (a use utf8; would then be appropriate).

      So when I write &#945;&#946;&#947;&#945;abc123 inline, then it displays as αβγαabc123 - and when I write αβγαabc123 in a code block it displays as &#945;&#946;&#947;&#945;abc123. I didn't know that.

Re^2: Perl strings questions
by bliako (Monsignor) on Jun 02, 2021 at 12:29 UTC

    I have update my post re: mixing strings

      Thanks, that clarifies some things. Yet, the python code does not mix text and binary.

      As far as I read the code, the binary stuff is BASE64 encoded. Well, yes, unfortunately "encoding" is used for a lot of things.

      Let me try to explain the difference:

      • UTF-8 is an encoding to map unicode characters to bytes. Unicode characters are identified by their code point. For ASCII characters and control characters like LINE FEED, their code point is equal to their "traditional" byte value, and also to their UTF-8 mapping. Perl's interface identifies characters by its code point, so you get a lowercase greek alpha by chr(945) or by "\x{3b1}". You also can use the names as in choroba's example: "\N{GREEK SMALL LETTER ALPHA}".
      • BASE64 is an encoding to map a stream of bytes, each of which in the range 0..255 ("binary data"), to a stream of bytes, each of which representing an ASCII character, The result happens to be valid UTF-8 (see above).

      Binary data will in most cases contain bytes in the range 128..255. Their UTF-8 encoding is not equal to their byte value. If you encode such bytes in UTF-8, it is like Perl interpreting their byte values as code points: Unicode has code points in that range with (not so) surprising similarity to ISO-8859-1. The code point for ö is U+00F6, but its UTF-8 encoding has two bytes X'C3B6'. So, if you encode binary data in UTF-8, the result is different, the process is deterministic and it is reversible.

      However, it depends on the receiving side to do a decoding of an UTF-8 stream into binary data and not into a unicode string. Perl happens to do that (because, as you wrote, it makes no difference), but not many other languages do. In general, you can not decode an UTF-8 stream into binary if it contains one or more characters with a code point greater than 255.

        Re: python does not mix binary and string, well I thought this hashlib.sha256(encoded).digest() was a binary hash. I am positive that I printed it to see, but right now I have no time, so I will update tomorrow. The rest is very useful and I will read it tomorrow.

        UPDATE: the digest() prints out b'+\xbd\xd0Z\xda;\x05\xbb\x80\x058(.' etc

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11133410]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-03-28 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found