Perl strings questions

bliako has asked for the wisdom of the Perl Monks concerning the following question:

Wise Monks,

I know that in a Perl string the 0x0 character has no special meaning (unlike C) and so one can mix binary data (range 0-255, i.e. it may contain 0x0) with a "normal printed" string (i.e. with chars in 1-127) and unicode strings. So, the following will work(?):

my $str = "hello"; # no special chars, 1-127
my $sha256 = Digest::SHA::sha256("abc123"); # bytes 0-255
my $hmac = Digest::SHA::hmac_sha512("a message", "a key"); # bytes 0-2
+55
# edit, the following is mangled by PM's editor, imagine greek letters
+ here:
my $unicode_string = "&#945;&#946;&#947;&#945;abc123"; # unicode chars
+ mixed with lower-ascii
my $buffer = ""; # buffer to concatenate above into and POST them
for($str, $sha256, $hmac, $unicode_string){
  $buffer .= Encode::encode("UTF-8", $_);
}
my $b64 = base64_encode($buffer);
my $HTTPheader = "ABC: $b64";
$ua->POST("aurl" ... $b64 ... $HTTPheader ...);
$ua->GET("aurl" ... $b64 ... $HTTPheader ...);
$ua->POST("aurl" ... $buffer ... $HTTPheader ...);
$ua->GET("aurl" ... $buffer ... $HTTPheader ...);
$ua->GET("aurl" ... $unicode_string ... $HTTPheader ...);
[download]

1) Is this the correct way to do this?

2) Also, I have a question about why I need to encode in "UTF-8". Does that make sure that the "double-bytes" and possible "single-bytes" are all becoming a stream of "single-bytes"?

3) How do I treat the $buffer, in Perl, before I do a POST and GET, assuming the receiver is liberal in what it accepts? Obviously a base64 is safe but under what conditions can I send $unicode_string as is. Will sending $buffer (as is, after treated with Encode) work?

4) Is Encode::encode("UTF-8", $sha256) altering my binary data? Is it harmful on strings with binary data?

(please correct my terminology and feel free to correct it, I tried to avoid encodings for too long - note also that I am trying to find the safe way to do things when strings are mixed, i don't have a particular requirement)

EDIT: Actually, I am trying to translate some python code into Perl (see https://docs.kraken.com/rest/#section/Authentication/Headers-and-Signature):

    postdata = urllib.parse.urlencode(data)
    encoded = (str(data['nonce']) + postdata).encode()
    message = urlpath.encode() + hashlib.sha256(encoded).digest()
[download]

And (@haj) I wanted to mix binary and non-binary strings like they do in message. I ended up with:

my $postdata = Encode::encode('UTF-8', "x=1&y=2&z=greektext"); # for e
+xample, 
my $p1 = "$nonsense".'&'.$postdata; # yes & needed
my $p1_utf8 = Encode::encode_utf8($p1);
my $api_sha256 = Digest::SHA::sha256($p1_utf8);
my $message = Encode::encode_utf8($api_path)
  . Encode::encode_utf8($api_method)
  . $api_sha256; #<< last one is binary

# ... and post after some more massaging
[download]

bw, bliako

Comment on Perl strings questions Select or Download Code

Replies are listed 'Best First'.
Re: Perl strings questions by Corion (Patriarch) on Jun 02, 2021 at 08:47 UTC
1) Is this the correct way to do this? No, not entirely - you should only `encode` things that are strings, not binary data. 3) How do I treat the $buffer, in Perl, before I do a POST and GET, assuming the receiver is liberal in what it accepts? Obviously a base64 is safe but under what conditions can I send $unicode_string as is. Will sending $buffer (as is, after treated with Encode) work? You should encode your data for the request manually unless you are sure that the UserAgent you're using does the encoding and adds the proper `Content-Type` headers. 4) Is Encode::encode("UTF-8", $sha256) altering my binary data? Is it harmful on strings with binary data? Yes, it is harmful unless the receiving end expects your data to be UTF-8-encoded.	[reply] [d/l] [select]
Re: Perl strings questions by choroba (Cardinal) on Jun 02, 2021 at 08:54 UTC
There's too much hand-waving (or ellipses) to answer with code. But try it yourself. server.pl `#! /usr/bin/perl use warnings; use strict; use Plack::Builder; use Plack::Request; use Data::Dumper; my $APP = sub { my ($env) = @_; my $req = 'Plack::Request'->new($env); [200, [], [Dumper($req)]] }; builder { mount '/' => $APP; };` [download] client.pl #!/usr/bin/perl use warnings; use strict; use Encode; use Digest::SHA; use MIME::Base64 qw{ encode_base64 }; use LWP::UserAgent; my $ua = 'LWP::UserAgent'->new; my $url = 'http://localhost:5000'; my $str = 'hello'; # no special chars, 1-127 my $sha256 = Digest::SHA::sha256('abc123'); # bytes 0-255 my $hmac = Digest::SHA::hmac_sha512('a message', 'a key'); # bytes 0-2 +55 my $unicode_string = "\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTE +R BETA}ab123"; # unicode chars mixed with lower-ascii my $buffer = ""; # buffer to concatenate above into and POST them for($str, $sha256, $hmac, $unicode_string){ $buffer .= Encode::encode('UTF-8', $_); } my $b64 = encode_base64($buffer); my @HTTPHeader = (ABC => $b64); my $response = $ua->post($url, @HTTPHeader, Content => $b64); # <-- C +hange to your liking. use Data::Dumper; print Dumper $response->content; ... [download] Now, run the server in one terminal and the client in another. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Perl strings questions by kikuchiyo (Hermit) on Jun 02, 2021 at 09:54 UTC
It helps if you think about it like this: There is the inside of your program, and there is the outside world, and there is (or at least, should be) one clean and definite interface between the two. In the inside of your program you use strings that consist of (potentially wide) characters. The outside world deals with a stream of octets. When you read something from the outside world into your program, you convert (decode) from octets to a string of characters, and when you emit something from your program to the outside world, you convert (encode) the characters into a stream of octets. The precise mechanics of those encode/decode steps depend on the application and the kind of protocol used. There isn't (and cannot be) a single answer whether you should or should not use `Encode::encode("UTF-8", ...)`. In your HTTP example, the need to encode depends on what the other side requires. Again, we can't tell in general. Presumably, it's documented by the service you're trying to POST to. The encoding you use should match the contents specified in the Content-Type header: if it's `text/html; charset=UTF-8`, then you must encode as UTF-8, if it's e.g. `image/png`, you must not.	[reply] [d/l] [select]
Re: Perl strings questions by haj (Vicar) on Jun 02, 2021 at 11:05 UTC
Corion and choroba have given good advice, so just some extras: `my $unicode_string = "αβγαabc123"; # unicode chars mixed with lower-ascii` This is not a unicode string. (choroba notes below that it originally was, but PerlMonks mangles these within code blocks) It is a HTML encoding of a unicode string in plain ASCII. To get a unicode string from this, do: `use HTML::Entities; my $html_string = "αβγαabc123"; my $unicode_string = decode_entities($html_string);` [download] 2) Also, I have a question about why I need to encode in "UTF-8". Does that make sure that the "double-bytes" and possible "single-bytes" are all becoming a stream of "single-bytes"? After UTF-8 encoding, you end up with a stream of single-byte characters. However, you do not strictly need to encode in UTF-8. The text encoding is an agreement between the sender and the receiver, this can either be done by explicit specification (example: `Content-Type: text/html; encoding=utf8`), by some standard or defaults (examples: XML defaults to UTF-8, HTML defaults to ISO-8859-1), or just by the developers talking to each other over a beer (not recommended). These days, UTF-8 is highly recommended because it is able to represent any unicode character in a consistent way. I am trying to find the safe way to do things when strings are mixed, My recommendation: Just don't do that. Text and binary data don't mix well in a simple string. Finding the borders is hard to do in a safe way since binary data may occasionally look like text.	[reply] [d/l]
Re^2: Perl strings questions by choroba (Cardinal) on Jun 02, 2021 at 11:10 UTC
> This is not a unicode string It was, but PerlMonks can't display it in a `<code>` block :-( `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^3: Perl strings questions by haj (Vicar) on Jun 02, 2021 at 11:23 UTC
Thanks for the info! I've added it to my article (a `use utf8;` would then be appropriate). So when I write `αβγαabc123` inline, then it displays as αβγαabc123 - and when I write `αβγαabc123` in a code block it displays as `αβγαabc123`. I didn't know that.	[reply]
Re^2: Perl strings questions by bliako (Monsignor) on Jun 02, 2021 at 12:29 UTC
I have update my post re: mixing strings	[reply]
Re^3: Perl strings questions by haj (Vicar) on Jun 02, 2021 at 14:25 UTC
Thanks, that clarifies some things. Yet, the python code does not mix text and binary. As far as I read the code, the binary stuff is BASE64 encoded. Well, yes, unfortunately "encoding" is used for a lot of things. Let me try to explain the difference: UTF-8 is an encoding to map unicode characters to bytes. Unicode characters are identified by their code point. For ASCII characters and control characters like `LINE FEED`, their code point is equal to their "traditional" byte value, and also to their UTF-8 mapping. Perl's interface identifies characters by its code point, so you get a lowercase greek alpha by `chr(945)` or by "\x{3b1}". You also can use the names as in choroba's example: `"\N{GREEK SMALL LETTER ALPHA}"`. BASE64 is an encoding to map a stream of bytes, each of which in the range 0..255 ("binary data"), to a stream of bytes, each of which representing an ASCII character, The result happens to be valid UTF-8 (see above). Binary data will in most cases contain bytes in the range 128..255. Their UTF-8 encoding is not equal to their byte value. If you encode such bytes in UTF-8, it is like Perl interpreting their byte values as code points: Unicode has code points in that range with (not so) surprising similarity to ISO-8859-1. The code point for � is `U+00F6`, but its UTF-8 encoding has two bytes `X'C3B6'`. So, if you encode binary data in UTF-8, the result is different, the process is deterministic and it is reversible. However, it depends on the receiving side to do a decoding of an UTF-8 stream into binary data and not into a unicode string. Perl happens to do that (because, as you wrote, it makes no difference), but not many other languages do. In general, you can not decode an UTF-8 stream into binary if it contains one or more characters with a code point greater than 255.	[reply]
Re^4: Perl strings questions by bliako (Monsignor) on Jun 02, 2021 at 17:32 UTC
Re: Perl strings questions by kikuchiyo (Hermit) on Jun 02, 2021 at 14:55 UTC
Re: code in the update: This part is almost certainly wrong: `my $postdata = Encode::encode('UTF-8', "x=1&y=2&z=greektext"); # for e +xample, my $p1 = "$nonsense".'&'.$postdata; # yes & needed my $p1_utf8 = Encode::encode_utf8($p1);` [download] You're encoding your text twice.	[reply] [d/l]
Re^2: Perl strings questions by bliako (Monsignor) on Jun 02, 2021 at 17:29 UTC
Thanks, that should then been urlencode() !!!	[reply]
Re: Perl strings questions by Anonymous Monk on Jun 03, 2021 at 00:41 UTC
I don't see much mention here of Perl's UTF-flag, even though it is discussed in the perldoc for Encode. The essence of UTF-encoding is that, if(!) you know to treat the string as "UTF-encoded," it provides a way to encode Unicode code-points (characters ...) in a byte-stream. But Perl is much older than UTF, so it might encounter what are intended to be byte-streams which coincidentally contain "UTF indicator" bytes. Perl implemented a hidden flag to indicate whether `eq` should or should not use Unicode-aware comparisons against the values.	[reply]
Re^2: Perl strings questions by choroba (Cardinal) on Jun 03, 2021 at 14:47 UTC
Using the flag in Perl code is a code smell. You can set the flag on any string, and you can clear it on any string. The flag doesn't know where the value comes from and what encoding it originally used. The function is_utf8 is also named incorrectly, as it in fact tells you whether the value uses wide characters representation internally. See #131685 for a related discussion. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: Perl strings questions by Your Mother (Archbishop) on Jun 03, 2021 at 01:22 UTC
Side-note for the side-show: Perl and Unicode were both born in 1987.	[reply]


Don't ask to ask, just ask
	PerlMonks