comment on

Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow length.

Invoking Encode::encode_utf8($data) returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection.

So:

print length chr 199;
[download]

outputs 1 while

use Encode; 
print length Encode::encode_utf8(chr 199);
[download]

outputs 2. Similarly, if you run

say join ",", map ord, split //, chr 199;
[download]

you output 199, while

use Encode; 
say join ",", map ord, split //, Encode::encode_utf8(chr 199);
[download]

outputs 195, 135.

However, if your terminal is set to display UTF-8, printing both of those strings will output the same because the series of bits is unaffected.

Does that help?

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

In reply to Re^3: performance of length() in utf-8 by kennethk
in thread performance of length() in utf-8 by seki

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks