[SOLVED] Unicode strings internals

vsespb has asked for the wisdom of the Perl Monks concerning the following question:

I have two example files

poc1.pl:

use strict;
use warnings;
use Devel::Peek;
use Encode;
use utf8;

my $string = "123\x{444}\x{444}\x{444}\x{444}";
binmode STDOUT, ":utf8";

Dump $string;
print "UTF IS ON\n" if utf8::is_utf8($string);
print "LENGTH DIFFERS\n" if length($string) != bytes::length($string);

open my $f, ">", "test1";
binmode $f;
syswrite $f, $string or die;
print "ALL OK\n";
__END__
SV = PV(0x258cb78) at 0x25b7bb0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x25aad60 "123\321\204\321\204\321\204\321\204"\0 [UTF8 "123\x{
+444}\x{444}\x{444}\x{444}"]
  CUR = 11
  LEN = 16
UTF IS ON
LENGTH DIFFERS
Wide character in syswrite at poc1.pl line 16.
[download]

poc2.pl:

use strict;
use warnings;
use Devel::Peek;
use Encode;
use utf8;

my $utfstring = "123 \x{439}\x{439}\x{439}\x{439}";
my ($ascii_but_utf, undef) = split ' ', $utfstring;

my $bytestring = encode ("UTF-8", "\x{444}\x{444}\x{444}\x{444}");
my $mixedstring = "$ascii_but_utf$bytestring"; # simulate The Unicode 
+Bug here
binmode STDOUT, ":utf8";

Dump $mixedstring;
print "UTF IS ON\n" if utf8::is_utf8($mixedstring);
print "LENGTH DIFFERS\n" if length($mixedstring) != bytes::length($mix
+edstring);

open my $f, ">", "test2";
binmode $f;
syswrite $f, $mixedstring or die;
print "ALL OK\n";

__END__
SV = PV(0x1d6eb48) at 0x1c7fab8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1d79820 "123\303\221\302\204\303\221\302\204\303\221\302\204\
+303\221\302\204"\0 [UTF8 "123\x{d1}\x{84}\x{d1}\x{84}\x{d1}\x{84}\x{d
+1}\x{84}"]
  CUR = 19
  LEN = 24
UTF IS ON
LENGTH DIFFERS
ALL OK
[download]

After __END__ of each file I appended program output.

Let's ignore for the moment the fact that those strings completely different and contains different characters and the fact that at some point of time one of the strings was interpreted as latin-1 etc

So, in both cases strings have UTF-8 bit set. They both have non ASCII-7bit octets. Their length() and bytes::length differs. And I expect those strings should behave same way

Question is why in one case string was treated as 'wide character string' and syswrite terminated the program. In other case all was working fine

p.s reproduced on perl 5.10 and perl 5.14 (linux)

UPD: escaped utf chars in sourcecode, as perlmonks eats it

UPD: SOLVED: http://www.perlmonks.org/?node_id=1032996 http://www.perlmonks.org/?node_id=1033006

Comment on [SOLVED] Unicode strings internals Select or Download Code

Replies are listed 'Best First'.
Re: Unicode strings internals by kennethk (Abbot) on May 10, 2013 at 15:49 UTC
If I'm reading your code correctly, the issue is that in your first case you have a properly formatted Perl string that contains UTF characters, but in the second you have a UTF-8 byte string, not a character string. The difference is discussed a bit in perluniintro and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). By explicitly invoking `encode ("UTF-8", ...`, the mixed string contains bytes with the high bit set, but not UTF-specific characters. Outputting a byte string as binary is natural, but outputting a Perl string that contains wide characters does not map without specifying an encoding. Does this clarify? If you describe the task you are trying to accomplish, we can probably help with the appropriate set of I/O specifications. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re^2: Unicode strings internals by vsespb (Chaplain) on May 10, 2013 at 17:03 UTC
in your first case you have a properly formatted Perl string that contains UTF characters Yes but in the second you have a UTF-8 byte string, not a character string. No. Second case does look like UTF-8 character string, because it prints "UTF IS ON" and "LENGTH DIFFERS"	[reply]
Re^3: Unicode strings internals by kennethk (Abbot) on May 10, 2013 at 17:39 UTC
Note that if you modify line 8 to `my $ascii_but_utf = '123';` [download] the output changes to `SV = PV(0x22ae1d0) at 0x2300b20 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x22d21f0 "123\321\204\321\204\321\204\321\204"\0 CUR = 11 LEN = 16 ALL OK` [download] This is because that UTF is on is just a historical artifact of your initialization. If we take a look at the two output files generated by these two cases, you'll note that both contain 11 bytes, despite the fact that the byte dump of the UTF-upgraded case should have output 19 bytes. This is because the internal representation of high-bit, 1-byte characters under Perl's implementation of UTF is multi-byte even though they cleanly map to 1-byte characters on output. You wouldn't expect these 1-byte characters to output a wide-character warning any more that you'd expect an ASCII character to. Second case does look like UTF-8 character string, You're thinking of that wrong; it could be a UTF character string, or a UTF-8 byte string. When dealing with non-ASCII characters in Perl, rare is the case when you should actually be thinking about Perl's internal representation. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^4: Unicode strings internals by vsespb (Chaplain) on May 10, 2013 at 18:49 UTC
Re^5: Unicode strings internals by kennethk (Abbot) on May 10, 2013 at 21:06 UTC
Some notes below your chosen depth have not been shown here
Re: Unicode strings internals by Krambambuli (Curate) on May 10, 2013 at 16:49 UTC
Have a look into the results of #!/usr/bin/perl use strict; use warnings; use Devel::Peek; use Encode; #use utf8; binmode STDOUT, ":utf8"; my $string1 = "123\x{444}\x{444}\x{444}\x{444}"; _display ($string1, 'STRING1' ); my $utfstring = "123 \x{439}\x{439}\x{439}\x{439}"; _display ($utfstring, 'UTF_STRING' ); my ($ascii_but_utf, undef) = split ' ', $utfstring; _display ($ascii_but_utf, 'ASCII_BUT_UTF' ); #my $bytestring = encode ("UTF-8", "\x{444}\x{444}\x{444}\x{444}"); my $bytestring = "\x{444}\x{444}\x{444}\x{444}"; _display ($bytestring, 'BYTESTRING' ); my $mixedstring = "$ascii_but_utf$bytestring"; # simulate The Unicode +Bug here _display ($mixedstring, 'MIXEDSTRING' ); print "MIXEDSTRING and STRING1 are supposed to be identical...\n"; exit; ############### sub _display { my ($string, $name) = @_; print "$name:\n"; Dump $string; my $l1 = length($string); my $l2 = bytes::length($string); if ($l1 != $l2) { print "LENGTHs DIFFERS: length: $l1, bytes: $l2\n" } print "UTF IS ON\n" if utf8::is_utf8($string); print "\n"; } [download] and then check the difference you see for BYTESTRING when running my $bytestring = encode ("UTF-8", "\x{444}\x{444}\x{444}\x{444}"); versus my $bytestring = "\x{444}\x{444}\x{444}\x{444}"; The Encode documentation has an Caveat about it: CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets might not be equal to $string. Though both contain the same data, the UTF8 flag for $octets is always off. When you encode anything, the UTF8 flag on the result is always off, even when it contains a completely valid utf8 string. See "The UTF8 flag" below.	[reply] [d/l]
Re^2: Unicode strings internals by vsespb (Chaplain) on May 10, 2013 at 17:12 UTC
Yes, I understand that result of encode("utf8", ... ) is a byte string with UTF-8 flag off. But that does not answer the question in my post. In my example, both poc1.pl and poc2.pl print strings with UTF-8 on, with length <> bytes::length, but those strings behave differently. Why?	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks