Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Many thanks for your valuable answer, I reproduced the same performance gains on my system, while not grasping the why.
I was told that since utf-8 string management was natively integrated into Perl core a string has an internal flag to tell if it is utf-8 or not. When parsing an xml file declared as Encoding="utf-8", the strings parsed by the XML SAX Parser are not given in utf-8? (I did not noticed that because I do not display processed data, so if the string is given undecoded i guess it is written as-is, but I should double-check that) I seem to understand that the SAX writer might query many times the data size, so the overweight of encoding the data is compensated by the many calls to a length() that has better performance. But I do not see how length() is different depending on the string encoding: if the string is not in utf-8 length should return a byte size (1 byte per character) while on utf-8 we must process each byte to know if it is a simple char, a starting byte of a multi-byte char, or a continuation byte of a multi-byte char. I would have think that processing an utf-8 string has worse performance than a plain string... Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters.Yes, but that is not a problem as I am asked to split the xml on a file size basis (per 30, 100 or 200 MB chunks) so counting the bytes is ok. In reply to Re^2: performance of length() in utf-8
by seki
|
|