http://qs321.pair.com?node_id=1156744

seki has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.

While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used .= operator to concatenate and length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were exponentially linerarly (as stated by readers) regressive.

I hacked another implementation based on a file to store the temp data, with print() to store and tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.

On my Win7 box, I noticed that the perl.exe process was heavily querying the Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to length in utf-8 context rather than .= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects: It seems to me that bigger is the string, longer is the time to get its size, (exponentialy? linearly). Here are the times I get:
with length() 1000 L= 256000 t=0.510602 2000 L= 512000 t=1.544809 3000 L= 768000 t=2.558511 4000 L= 1024000 t=3.598220 5000 L= 1280000 t=4.622924 6000 L= 1536000 t=5.666133 7000 L= 1792000 t=6.634827 8000 L= 2048000 t=7.653030 9000 L= 2304000 t=8.687737 10000 L= 2560000 t=9.727445 11000 L= 2816000 t=10.728646 12000 L= 3072000 t=11.764352 13000 L= 3328000 t=12.804560 14000 L= 3584000 t=13.783257 15000 L= 3840000 t=14.836966 with a scalar 1000 L= 256000 t=0.003712 2000 L= 512000 t=0.003200 3000 L= 768000 t=0.003433 4000 L= 1024000 t=0.003232 5000 L= 1280000 t=0.003398 6000 L= 1536000 t=0.003669 7000 L= 1792000 t=0.004407 8000 L= 2048000 t=0.002218 9000 L= 2304000 t=0.004507 10000 L= 2560000 t=0.002192 11000 L= 2816000 t=0.005269 12000 L= 3072000 t=0.002203 13000 L= 3328000 t=0.006128 14000 L= 3584000 t=0.002311 15000 L= 3840000 t=0.002231
In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?

In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...

Here is my test code:
use strict; use warnings; use feature 'say'; use feature 'state'; use utf8; use Time::HiRes; $|++; my $chunk = '€' x 256; my $td = Time::HiRes::time; my $tf; my $l; say "with length()"; my $str = new LenTestA; for my $n (1..15_000){ state $count = 0; $str->add($chunk); $l = $str->len; $count++; if ($count % 1000 == 0){ $tf = Time::HiRes::time; say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td; $td = $tf; } } $td = Time::HiRes::time; say "\nwith a scalar"; $str = new LenTestB; for my $n (1..15_000){ state $count = 0; $str->add($chunk); $l = $str->len; $count++; if ($count % 1000 == 0){ $tf = Time::HiRes::time; say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td; $td = $tf; } } { package LenTestA; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= $data; } sub len { my $self = shift; return length $$self; } } { package LenTestB; my $len; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= $data; $len += length($data); } sub len { my $self = shift; return $len; } }
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian