Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^4: performance of length() in utf-8

by seki (Monk)
on Mar 11, 2016 at 13:23 UTC ( [id://1157430]=note: print w/replies, xml ) Need Help??


in reply to Re^3: performance of length() in utf-8
in thread performance of length() in utf-8

Here is a quite long answer to try to be specific on my understanding of that case...
Does that help?
In some way, but not completely. :op
I am quite familiar with encodings (at least iso-8859-1 & 15, Win1252, "DOS" 437 & 850 utf-8 and utf-16) but I did not figured the data flow in Perl, yet.

I think I did not get what part of the "magic" is done
  • at the (windows CMD) terminal level
  • by the xml parsing / decoding (if any?)
  • at the Perl internal level

chcp Active code page: 1252 perl -e "print chr 199" Ç perl -e "print join ' ', map {sprintf '%02x', $_} unpack 'C*', chr 199 +" c7
I am in Win1252 and the code 199 (= 0xc7) corresponds to the upper-case c-cedilla character. Okay.
perl -MEncode -e "print Encode::encode_utf8 chr 199" Ç perl -MEncode -e "print join ' ', map {sprintf '%02x', $_} unpack 'C*' +, Encode::encode_utf8 chr 199" c3 87
So if encode the byte 199 to utf-8 (I seem to understand "from the current console codepage"), I get the values c3 87 that correspond to the U+00c7 unicode "LATIN CAPITAL LETTER C WITH CEDILLA". I still follow.
perl -MEncode -e "print Encode::decode_utf8 \"\xc3\x87\"" Ç
If I decode a raw "c3 87" I get back my "Ç", so everything is how I suppose it to be.
Now, your part:
Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow length.
Well... Not sure: Here is a simple utf8-1.xml file:
<?xml version="1.0" encoding="utf-8"?> <root>Ç foo</root>
(to be sure, if hex-editing the file, we see actually C3 87 in the place of the char 199)
With a little sax parser:
use strict; use warnings; use feature 'say'; #~ use utf8; use XML::SAX::ParserFactory; $|++; #to force one kind of parser for ParserFactory->parser() #~ $XML::SAX::ParserPackage = "XML::SAX::PurePerl"; #~ $XML::SAX::ParserPackage = "XML::SAX::Expat"; #no xml_decl #~ $XML::SAX::ParserPackage = "XML::SAX::ExpatXS"; #~ $XML::SAX::ParserPackage = "XML::LibXML::SAX"; $XML::SAX::ParserPackage = "XML::LibXML::SAX::Parser"; { package MySax; use feature 'say'; use Devel::Peek; sub new { my $class = shift; return bless {}, $class; } sub hexprint { my ($self, $data) = @_; join ' ', map { sprintf '%02X', $_ } unpack 'C*', $data; } sub characters { my ($self, $data) = @_; my $content = $data->{Data}; say "characters for elt: ". $content; say "bytes for elt: ". $self->hexprint($content); Dump($content); } } my $handler = new MySax; my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); say "parser is " . ref $parser; say "file: " . $ARGV[0] if $ARGV[0]; $parser->parse_file($ARGV[0] // *DATA); __DATA__ <empty/>
I can see:
perl sax_utf.pl utf8-1.xml parser is XML::LibXML::SAX::Parser file: utf8-1.xml characters for elt: Ç foo bytes for elt: C7 20 66 6F 6F SV = PV(0x288c658) at 0x233d2e8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8) PV = 0x2b28228 "\303\207 foo"\0 [UTF8 "\x{c7} foo"] CUR = 6 LEN = 10 COW_REFCNT = 1
Can I assume the following:
  • the 199 / 0xC7 character was decoded by libxml, as I see that its byte is "C7"
  • but the string is flagged as utf-8?
  • and internaly, the byte flow is actually some utf-8, as shown by the (unusual but in my Emacs editor) \303\207 octal values = C3 87

So 1 )in can't understand the difference between the unpack and Devel::Peek dumps.
and 2) I cannot see why would do the following
Invoking Encode::encode_utf8($data) returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection.
If the string is already in utf-8, why processing it with encode_utf8 ?
If I patch the sub characters like this:
sub characters { use Encode; my ($self, $data) = @_; my $content = Encode::encode_utf8 $data->{Data}; say "characters for elt: ". $content; say "bytes for elt: ". $self->hexprint($content); Dump($content); }

Now I see (still in a Windows console in cp1252) :
characters for elt: Ç foo bytes for elt: C3 87 20 66 6F 6F SV = PV(0x28ba328) at 0x236d2b8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK) PV = 0x2b548b8 "\303\207 foo"\0 CUR = 6 LEN = 10 COW_REFCNT = 1
So unpacking the string shows the expected C3 87 bytes for the char 199, confirmed by the octal dum, but the UTF8 flag has vanished? I'm puzzled!

Now an additional challenge: I make a copy of the first xml, to add the euro sign into the data ("Ç foo €") so the hex-editing of the file shows C3 87 20 66 6F 6F 20 E2 82 AC.
With the non utf-8 forcing of the string, it shows this in the console:
parser is XML::LibXML::SAX::Parser file: utf8-2.xml Wide character in say at sax_utf.pl line 36. characters for elt: Ç foo € bytes for elt: C7 20 66 6F 6F 20 20AC SV = PV(0x2a61748) at 0x250ade8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8) PV = 0x2cefc98 "\303\207 foo \342\202\254"\0 [UTF8 "\x{c7} foo \x{20 +ac}"] CUR = 10 LEN = 12 COW_REFCNT = 1
Now I am not sure of the byte representation:
  • it could be some Win1252, for the C7, but the euro char is 80 in 1252, while the 20AC seems to the U+20AC unicode char and not the E2 82 AC utf-8, and why 20AC while unpack should show bytes?
  • the "Ç foo" part is not displayed identically with that additional character

Forcing the data with encode_utf8 seems less surprising
parser is XML::LibXML::SAX::Parser file: utf8-2.xml characters for elt: Ç foo € bytes for elt: C3 87 20 66 6F 6F 20 E2 82 AC SV = PV(0x2991768) at 0x243ade8 REFCNT = 1 FLAGS = (PADMY,POK,IsCOW,pPOK) PV = 0x2c1fc98 "\303\207 foo \342\202\254"\0 CUR = 10 LEN = 12 COW_REFCNT = 1
While I still do not understand the missing UTF8 flag...

Replies are listed 'Best First'.
Re^5: performance of length() in utf-8
by kennethk (Abbot) on Mar 11, 2016 at 21:40 UTC
    So, as with everything in Perl, operating in a Microsoft context complicates things. I also was a little technically sloppy with my description for the sake of some simplified high order concept, which I should know is just a recipe for confusion. So I apologize.

    A read through of Unicode Support in perlguts as well as perluniintro, perlunitut, and perlunicode might be helpful for further clarifications.

    If Perl always used UTF-8 for internal operation, things would be slow (as per the OP). So for strings that are representable via the system's codepage. Specifically (from perluniintro):

    Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.
    So until Perl encounters a reason, it will not flip the UTF8 flag you are seeing via Devel::Peek. In your scenario, your XML parser sees the UTF-8 encode at the top of the file, and so the flag gets thrown. Note that if you run
    perl -MDevel::Peek -E "Dump chr 199"
    you get something like
    SV = PV(0x15ceba8) at 0x15ed468 REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,pPOK) PV = 0x16359c0 "\307"\0 CUR = 1 LEN = 12
    The UTF8 flag is not set, and the same character is being stored according to the local code page.

    The thing that seems to be missing from your thinking is serialization. When you feed a string through encode_utf8, you are saying take this logical object, and encode it for communication via a channel that expects UTF-8, much like you might have a channel that expects JSON or a channel that expects little-endian. The resultant bit stream is the encoded stream, and none of the characters it contains are UTF-8 - logically, it contains no wide characters, though it may have a number of high-bit characters. You need to decode the stream in order for it to make sense logically. Now, if Perl is hooked up to a UTF-8 terminal, it'll look right, and if it's hooked up to a 1252 terminal, you'll get junk.

    Hopefully this helps?


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      So, as with everything in Perl, operating in a Microsoft context complicates things.

      s/Perl/life/;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1157430]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (1)
As of 2024-04-25 01:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found