What does utf8::upgrade actually do.

syphilis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What does utf8::upgrade actually do. by dave_the_m (Monsignor) on Feb 17, 2021 at 09:04 UTC
Perl's strings are formally list of codepoints. So the following applies: `@codepoints = map ord($_), split //, $s1; $s2 = join '', map chr($_), @codepoints; ok($s1 eq $s2); ok(length($s1) == length($s2);` [download] How perl internally stores those those codepoints is up to perl, and perl-level code mostly needn't care about the difference. XS code on the other hand needs to know about it if its going to start rummaging around accessing the individual bytes making up the string's storage. Currently perl uses two storage formats - traditional one byte per codepoint and utf8 variable length encoding. The encoding is indicated by the SVf_UTF8 flag. You can't guarantee which encoding will be used - that's up to perl. For example at the moment: `$s1 = "abc\x80"; $s2 = $s1; # currently SVf_UTF8 not set; string uses 4 bytes of + storage $s2 .= "\x{100}"; # currently perl upgrades to SVf_UTF8 and converts t +he 0x80 and 0x100 into multi-byte representations chop($s2); # currently perl doesn't downgrade; the 0x80 codepoi +nt still stored as 2 bytes ok($s1 eq $s2); ok(length($s1) == length($s2));` [download] utf8::upgrade() and utf8::downgrade() are just ways of forcing the encoding format of the internal representation - useful for demonstrating bugs in modules which make assumptions. Note that they don't change the semantics of the string - perl thinks they still have the same length and codepoints. To continue the example above: `utf8::downgrade($s2); # the the 0x80 codepoint now stored as 1 byte ok(length($s1) == $length($s2)); ok($s1 eq $s2);` [download] What an XS module does when it wants to process a list of bytes is of course up to the XS module's author. However, just using the current bytes of the internal representation is probably a poor choice - two strings which are semantically identical at the perl level but have different internal representations will give different results (e.g. the $s1 above initially and the $s2 after the chop()). If there is no sensible interpretation of the meaning of codepoints > 0xff then I would suggest the XS code should check the SVf_UTF8 flag and if present,try to downgrade the string, and if not possible, croak. Dave.	[reply] [d/l] [select]
Re^2: What does utf8::upgrade actually do. by syphilis (Archbishop) on Feb 17, 2021 at 12:34 UTC
What an XS module does when it wants to process a list of bytes is of course up to the XS module's author Ok - but I guess the module author (me) should probably document the procedure that the module takes. (The lack of any such documentation seems to have been a part of ribasushi's objection, and I think that's fair enough.) For simplicity, let's stick to a single-byte string: `use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr(ord $v); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # prints the value assigned to $v (ie 255).` [download] But let's say the user instead does a utf8::upgrade of the string, as per the following: `use Math::GMPz qw(:mpz); my $z = Math::GMPz->new(); my $v = 255; $str = chr($v); utf8::upgrade($str); Rmpz_import($z, 1, 1, 1, 0, 0, $str); print $z; # now prints 195.` [download] The crux of the issue is "what do I (the module author) conclude regarding the expectation of the user that wrote that second block of code ? " As I see it, I have only 3 choices: a) conclude that the user's expected result is to see an output of "255"; b) conclude that the user's expected result is to see an output of "195"; c) conclude that I have insufficient information to know what output the user expects (except that the user will be expecting either "255" or "195"). Which is the correct conclusion for me to reach ? I can accommodate either 'a)', 'b)', or 'c)' and I think the answer is probably 'c)', but I'd just like an informed opinion on that. Cheers, Rob	[reply] [d/l] [select]
Re^3: What does utf8::upgrade actually do. by dave_the_m (Monsignor) on Feb 17, 2021 at 13:58 UTC
I would very strongly suggest that the user should expect Rmpz_import() to process the series of base-256 "digits" obtained by (map ord($_), split //, $str), regardless of the internal encoding of the string. So (a) is the correct result. (b) is just horrible, and is repeating the broken Unicode model that appeared in perl 5.6 and was (mostly) fixed by perl 5.8. Your only real decision needs to be what to do for a codepoint > 0xff. Three obvious choices are: croak; treat each codepoint modulo 256, or carry the overflow into the next digit. So the string "\x40\x{150}\x60" would yield the integer value 0x615040. (I haven't looked at what endedness the function works to, but that should give you the general idea of what I mean.) Dave.	[reply]
Re^4: What does utf8::upgrade actually do. by syphilis (Archbishop) on Feb 18, 2021 at 14:25 UTC
Re^3: What does utf8::upgrade actually do. by roboticus (Chancellor) on Feb 17, 2021 at 17:20 UTC
syphilis: I'd expect to see 255, but I wouldn't object to seeing a warning if the SVf_UTF8 flag was set on the input variable. The GMP manual gives enough information for an experienced programmer to see that GMP is expecting a binary vector of fixed-length words to process, and the internal UTF codepoints of perl are clearly not that. So handing a UTF encoded string to that function is at least suspicious. I think I'd add a chunk to the modules POD to tell users how to handle UTF strings, and make the module issue a warning if it's presented with a UTF string, so they'd be directed to look at that part of the documentation. You might also modify the $order and/or $endian parameters to give a combination that would let them indicate that you should do the decode for them if they see the UTF string. My reasoning is essentially that the GMP documentation for import clearly indicates that we should be treating the data as a vector of fixed-length words, and UTF encoding is not that. If we see a UTF flag on a string, I'd expect that some conversion happened somewhere (whether intentional or unintentional) such that the oft-assumed¹ bytes == characters assumption does not necessarily hold true. I often wish that we had a flag on the variables that would let us specify that the buffer holds an exact representation of the bytes that came from the data source, so we could tell when the data was munged. But of course, I have no idea how to define appropriate semantics, as there's no way to get people to agree on the set of cases where we could change the string without turning that flag off (chop, chomp, s///, tr, ....), and/or how to create a string with the flag set appropriately without too much fuss and bother. Note 1: Sure it's a bad assumption in many contexts, but many perl-mongers (myself included) do much more binary-processing than processing involving unicode. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: What does utf8::upgrade actually do. by Tux (Canon) on Feb 18, 2021 at 09:21 UTC
Also keep in mind that a perl "string" does not need to be a single encoding for all of its content. Think XML and CSV where parts can be real binary and parts can be encoded. Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption. One more thing to keep in mind with codepoints is that Unicode allows a lot. e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as `e1 b8 af`, `c3 af cc 81`, `c3 ad cc 88`, `69 cc 81 cc 88`, or `69 cc 88 cc 81`, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes: (update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f) #!/usr/bin/perl use 5.18.2; use warnings; use Data::Peek; use Unicode::Normalize qw( normalize ); use Encode qw( encode decode ); use charnames qw(:full); sub dp { my ($tag, $dta) = @_; my $dp = DPeek ($dta); printf "%-6s: %-52s", $tag, $dp =~ s{^(\S+)\K}{" " x (26 - length +$1)}er; utf8::is_utf8 ($dta) and print join " + " => map { charnames::viacode (ord) } split // +=> $dta; say ""; } # dp $\| = 1; foreach my $bytes ( "\xe1\xb8\xaf", "\xc3\xaf\xcc\x81", "\xc3\xad\xcc\x88", "\x69\xcc\x81\xcc\x88", "\x69\xcc\x88\xcc\x81", ) { my $u = decode ("utf-8", $bytes); dp ("Bytes", $bytes); dp ("UTF-8", $u); dp ("NF$_", normalize ($_, $u)) for qw( D C KD KC ); say ""; } [download] -> Bytes : PV("\341\270\257"\0) UTF-8 : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE Bytes : PV("\303\257\314\201"\0) UTF-8 : PV("\303\257\314\201"\0) [UTF8 "\x{ef}\x{301}"] LATIN SMAL +L LETTER I WITH DIAERESIS + COMBINING ACUTE ACCENT NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE Bytes : PV("\303\255\314\210"\0) UTF-8 : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS Bytes : PV("i\314\201\314\210"\0) UTF-8 : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL +L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL +L LETTER I WITH ACUTE + COMBINING DIAERESIS Bytes : PV("i\314\210\314\201"\0) UTF-8 : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL +L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL +L LETTER I WITH DIAERESIS AND ACUTE [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: What does utf8::upgrade actually do. by ikegami (Patriarch) on Feb 19, 2021 at 18:56 UTC
Perl has three internal storage formats for numbers: signed integer, unsigned integer and floating point number. Similarly, Perl has two internal storage formats for strings (described below). `utf8::is_utf8` identifies the format used, and `utf8::upgrade` and `utf8::downgrade` convert how a string is stored internally. `use Devel::Peek qw( Dump ); my $s = chr(0xE9); say length($s); # 1 say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 0 Dump($s); # PV contains E9 utf8::upgrade($s); say length($s); # 1 The string hasn't changed say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 1 But it's now stored differently. Dump($s); # PV contains C3 A9 utf8::downgrade($s); say length($s); # 1 say $s eq "\xE9" ?1:0; # 1 say utf8::is_utf8($s) ?1:0; # 0 Dump($s); # PV contains E9` [download] "Downgraded" format Identified by the `SVf_UTF8` flag (returned by `utf8::is_utf8($sv)` in Perl and `SvUTF8(sv)` in C) being clear. Each character (string element) is capable of storing an 8-bit value. Great for bytes. Not so good for text. Each character is stored as a single byte. This allows very efficient access of arbitrary characters and very efficient access of the length of the string (both O(1)). "Upgraded" format Identified by the `SVf_UTF8` flag (returned by `utf8::is_utf8($sv)` in Perl and `SvUTF8(sv)` in C) being set. Each character (string element) is capable of storing a 72-bit value (in theory), a 64-bit value (on builds with `uvsize` of 8) or a 32-bit value (on builds with `uvsize` of 4). This is more than enough to store any Unicode Code Point. Each character is stored as its utf8 encoding. utf8 is an proprietary extension of UTF-8. As a variable-length encoding, both accessing arbitrary characters and accessing the length of the string are very inefficient (O(N)), though Perl does attach the length of the string to the scalar when it becomes known, and it even attaches some character positions in some situations. The Unicode Bug Notice how I didn't say format X is used to store Y. That's because Perl imparts no semantics on the choice of storage format. Just like three stored as a signed integer and three stored as a floating point number both refer to the same number, strings consisting of the same characters but stored in different formats are still considered the same string (i.e. `eq` will return true). However, some code (particularly XS modules, but even some builtin operators) intentionally or inadvertently impart meaning on the choice of internal storage format of strings. Code does that does this is said to be suffering from The Unicode Bug. `utf8::upgrade` and `utf8::downgrade` are useful when working with such buggy code. `Rmpz_import` is such a function. Without knowing the details, switching to `SvPVbyte` is a sensible solution. (This would mean you can't receive strings with characters larger than 255, though.) Other options include upgrading the string (`SvPVutf8`) and handling both formats (by checking `SvUTF8(sv)`). Seeking work! You can reach me at ikegami@adaelis.com	[reply] [d/l] [select]
Re^2: What does utf8::upgrade actually do. by ikegami (Patriarch) on Feb 19, 2021 at 19:27 UTC
Added to my answer (parent post). In particular, tied it back to `Rmpz_import`. Seeking work! You can reach me at ikegami@adaelis.com	[reply] [d/l]