comment on

Many thanks for your valuable answer, I reproduced the same performance gains on my system, while not grasping the why.

I was told that since utf-8 string management was natively integrated into Perl core a string has an internal flag to tell if it is utf-8 or not.

When parsing an xml file declared as Encoding="utf-8", the strings parsed by the XML SAX Parser are not given in utf-8? (I did not noticed that because I do not display processed data, so if the string is given undecoded i guess it is written as-is, but I should double-check that)
I seem to understand that the SAX writer might query many times the data size, so the overweight of encoding the data is compensated by the many calls to a length() that has better performance.
But I do not see how length() is different depending on the string encoding: if the string is not in utf-8 length should return a byte size (1 byte per character) while on utf-8 we must process each byte to know if it is a simple char, a starting byte of a multi-byte char, or a continuation byte of a multi-byte char. I would have think that processing an utf-8 string has worse performance than a plain string...

Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters.

Yes, but that is not a problem as I am asked to split the xml on a file size basis (per 30, 100 or 200 MB chunks) so counting the bytes is ok.

In reply to Re^2: performance of length() in utf-8 by seki
in thread performance of length() in utf-8 by seki

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks