performance of length() in utf-8

seki has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.

While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used .= operator to concatenate and length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were ~~exponentially~~ linerarly (as stated by readers) regressive.

I hacked another implementation based on a file to store the temp data, with print() to store and tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.

On my Win7 box, I noticed that the perl.exe process was heavily querying the Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to length in utf-8 context rather than .= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects:

one is using length on the resulting string to return its size
the second gets the length of the added data and keeps the size in a scalar that is returned on demand

It seems to me that bigger is the string, longer is the time to get its size, (~~exponentialy?~~ linearly). Here are the times I get:

with length()
        1000 L=    256000 t=0.510602
        2000 L=    512000 t=1.544809
        3000 L=    768000 t=2.558511
        4000 L=   1024000 t=3.598220
        5000 L=   1280000 t=4.622924
        6000 L=   1536000 t=5.666133
        7000 L=   1792000 t=6.634827
        8000 L=   2048000 t=7.653030
        9000 L=   2304000 t=8.687737
       10000 L=   2560000 t=9.727445
       11000 L=   2816000 t=10.728646
       12000 L=   3072000 t=11.764352
       13000 L=   3328000 t=12.804560
       14000 L=   3584000 t=13.783257
       15000 L=   3840000 t=14.836966

with a scalar
        1000 L=    256000 t=0.003712
        2000 L=    512000 t=0.003200
        3000 L=    768000 t=0.003433
        4000 L=   1024000 t=0.003232
        5000 L=   1280000 t=0.003398
        6000 L=   1536000 t=0.003669
        7000 L=   1792000 t=0.004407
        8000 L=   2048000 t=0.002218
        9000 L=   2304000 t=0.004507
       10000 L=   2560000 t=0.002192
       11000 L=   2816000 t=0.005269
       12000 L=   3072000 t=0.002203
       13000 L=   3328000 t=0.006128
       14000 L=   3584000 t=0.002311
       15000 L=   3840000 t=0.002231
[download]

In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?

In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...

Here is my test code:

use strict;
use warnings;
use feature 'say';
use feature 'state';
use utf8;
use Time::HiRes;

$|++;
my $chunk = '€' x 256;
my $td = Time::HiRes::time;
my $tf;
my $l;

say "with length()";
my $str = new LenTestA;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

$td = Time::HiRes::time;
say "\nwith a scalar";
$str = new LenTestB;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

{
    package LenTestA;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .= $data;
    }
    sub len {
        my $self = shift;
        return length $$self;
    }
}
{
    package LenTestB;
    my $len;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .= $data;
        $len += length($data);
    }
    sub len {
        my $self = shift;
        return $len;
    }
}
[download]

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Comment on performance of length() in utf-8 Select or Download Code

Replies are listed 'Best First'.
Re: performance of length() in utf-8 by kennethk (Abbot) on Mar 03, 2016 at 17:20 UTC
An easier solution is to dodge the UTF-8 problem. The slowness of length is because (from length) length() normally deals in logical characters, not physical bytes Essentially, in order to know how many characters are in the string, `length` has to interrogate every byte to see if it is part of a longer character. (Incidentally, your timings look linear, not exponential to me). You can avoid this challenge if instead of storing the string as you encounter it, store it encoded: `{ package LenTestC; use Encode; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= Encode::encode_utf8($data); } sub len { my $self = shift; return length $$self; } }` [download] The target string never gets upgraded to UTF-8, and thus the fast length algorithm can be used. Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters. Read more... (4 kB) #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: performance of length() in utf-8 by seki (Monk) on Mar 04, 2016 at 10:59 UTC
Many thanks for your valuable answer, I reproduced the same performance gains on my system, while not grasping the why. I was told that since utf-8 string management was natively integrated into Perl core a string has an internal flag to tell if it is utf-8 or not. When parsing an xml file declared as `Encoding="utf-8"`, the strings parsed by the XML SAX Parser are not given in utf-8? (I did not noticed that because I do not display processed data, so if the string is given undecoded i guess it is written as-is, but I should double-check that) I seem to understand that the SAX writer might query many times the data size, so the overweight of encoding the data is compensated by the many calls to a `length()` that has better performance. But I do not see how length() is different depending on the string encoding: if the string is not in utf-8 length should return a byte size (1 byte per character) while on utf-8 we must process each byte to know if it is a simple char, a starting byte of a multi-byte char, or a continuation byte of a multi-byte char. I would have think that processing an utf-8 string has worse performance than a plain string... Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters. Yes, but that is not a problem as I am asked to split the xml on a file size basis (per 30, 100 or 200 MB chunks) so counting the bytes is ok.	[reply] [d/l] [select]
Re^3: performance of length() in utf-8 by kennethk (Abbot) on Mar 04, 2016 at 17:03 UTC
Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow `length`. Invoking `Encode::encode_utf8($data)` returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection. So: `print length chr 199;` [download] outputs `1` while `use Encode; print length Encode::encode_utf8(chr 199);` [download] outputs `2`. Similarly, if you run `say join ",", map ord, split //, chr 199;` [download] you output `199`, while `use Encode; say join ",", map ord, split //, Encode::encode_utf8(chr 199);` [download] outputs `195, 135`. However, if your terminal is set to display UTF-8, printing both of those strings will output the same because the series of bits is unaffected. Does that help? #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^4: performance of length() in utf-8 by seki (Monk) on Mar 11, 2016 at 13:23 UTC
Re^5: performance of length() in utf-8 by kennethk (Abbot) on Mar 11, 2016 at 21:40 UTC
Some notes below your chosen depth have not been shown here
Re: performance of length() in utf-8 by poj (Abbot) on Mar 03, 2016 at 17:38 UTC
Can you confirm my analysis 5-20 times slower is mentioned here Speed	[reply]
Re: performance of length() in utf-8 by Anonymous Monk on Mar 03, 2016 at 19:46 UTC
It seems to me that bigger is the string, longer is the time to get its size, (exponentialy?). Linearly. `Perl_utf8_length` is basically: `while (start_ptr < end_ptr) { start_ptr += UTF8SKIP(start_ptr); len++; }` [download] and `UTF8SKIP` just returns the number of bytes in a utf-8 encoded codepoint given the starting byte. So, basically, `PL_utf8_length` is O(N), where N is the number of codepoins in (SvUTF8) string. Your `LenTestA` is exponential because it basically does `PL_utf8_length($chunk) # first iteration PL_utf8_length($chunk . $chunk) # second iteration PL_utf8_length($chunk . $chunk . $chunk) # third iteration ...and so on...` [download] while `LenTestB` is `PL_utf8_length($chunk) # 1st PL_utf8_length($chunk) # 2nd PL_utf8_length($chunk) # 3rd ...` [download]	[reply] [d/l] [select]
Re^2: performance of length() in utf-8 by kennethk (Abbot) on Mar 03, 2016 at 20:38 UTC
Your LenTestA is exponential That would be quadratic growth. You have a series of operations that grows linearly, and your number of invocations grows linearly with line length. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re^3: performance of length() in utf-8 by Anonymous Monk on Mar 03, 2016 at 20:46 UTC
Maybe, maybe, I don't speak english very well. Basically, it's O(N**2).	[reply]
Re^4: performance of length() in utf-8 by kennethk (Abbot) on Mar 03, 2016 at 21:49 UTC
Re^4: performance of length() in utf-8 by SuicideJunkie (Vicar) on Mar 03, 2016 at 21:51 UTC


Your skill will accomplish what the force of many cannot
	PerlMonks