Pathologically Eclectic Rubbish Lister | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow length. Invoking Encode::encode_utf8($data) returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection. So: outputs 1 while outputs 2. Similarly, if you run you output 199, while outputs 195, 135. However, if your terminal is set to display UTF-8, printing both of those strings will output the same because the series of bits is unaffected. Does that help? #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way. In reply to Re^3: performance of length() in utf-8
by kennethk
|
|