|Problems? Is your data what you think it is?|
performance of length() in utf-8by seki (Scribe)
|on Mar 03, 2016 at 16:38 UTC||Need Help??|
seki has asked for the wisdom of the Perl Monks concerning the following question:
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.
While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used .= operator to concatenate and length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were
I hacked another implementation based on a file to store the temp data, with print() to store and tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.
On my Win7 box, I noticed that the perl.exe process was heavily querying the Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to length in utf-8 context rather than .= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects:
In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?
In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...
Here is my test code:
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian