Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

performance of length() in utf-8

by seki (Scribe)
on Mar 03, 2016 at 16:38 UTC ( #1156744=perlquestion: print w/replies, xml ) Need Help??

seki has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.

While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used .= operator to concatenate and length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were exponentially linerarly (as stated by readers) regressive.

I hacked another implementation based on a file to store the temp data, with print() to store and tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.

On my Win7 box, I noticed that the perl.exe process was heavily querying the Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to length in utf-8 context rather than .= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects:
  • one is using length on the resulting string to return its size
  • the second gets the length of the added data and keeps the size in a scalar that is returned on demand
It seems to me that bigger is the string, longer is the time to get its size, (exponentialy? linearly). Here are the times I get:
with length() 1000 L= 256000 t=0.510602 2000 L= 512000 t=1.544809 3000 L= 768000 t=2.558511 4000 L= 1024000 t=3.598220 5000 L= 1280000 t=4.622924 6000 L= 1536000 t=5.666133 7000 L= 1792000 t=6.634827 8000 L= 2048000 t=7.653030 9000 L= 2304000 t=8.687737 10000 L= 2560000 t=9.727445 11000 L= 2816000 t=10.728646 12000 L= 3072000 t=11.764352 13000 L= 3328000 t=12.804560 14000 L= 3584000 t=13.783257 15000 L= 3840000 t=14.836966 with a scalar 1000 L= 256000 t=0.003712 2000 L= 512000 t=0.003200 3000 L= 768000 t=0.003433 4000 L= 1024000 t=0.003232 5000 L= 1280000 t=0.003398 6000 L= 1536000 t=0.003669 7000 L= 1792000 t=0.004407 8000 L= 2048000 t=0.002218 9000 L= 2304000 t=0.004507 10000 L= 2560000 t=0.002192 11000 L= 2816000 t=0.005269 12000 L= 3072000 t=0.002203 13000 L= 3328000 t=0.006128 14000 L= 3584000 t=0.002311 15000 L= 3840000 t=0.002231
In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?

In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...

Here is my test code:
use strict; use warnings; use feature 'say'; use feature 'state'; use utf8; use Time::HiRes; $|++; my $chunk = '' x 256; my $td = Time::HiRes::time; my $tf; my $l; say "with length()"; my $str = new LenTestA; for my $n (1..15_000){ state $count = 0; $str->add($chunk); $l = $str->len; $count++; if ($count % 1000 == 0){ $tf = Time::HiRes::time; say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td; $td = $tf; } } $td = Time::HiRes::time; say "\nwith a scalar"; $str = new LenTestB; for my $n (1..15_000){ state $count = 0; $str->add($chunk); $l = $str->len; $count++; if ($count % 1000 == 0){ $tf = Time::HiRes::time; say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td; $td = $tf; } } { package LenTestA; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= $data; } sub len { my $self = shift; return length $$self; } } { package LenTestB; my $len; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= $data; $len += length($data); } sub len { my $self = shift; return $len; } }
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Replies are listed 'Best First'.
Re: performance of length() in utf-8
by kennethk (Abbot) on Mar 03, 2016 at 17:20 UTC
    An easier solution is to dodge the UTF-8 problem. The slowness of length is because (from length)
    length() normally deals in logical characters, not physical bytes
    Essentially, in order to know how many characters are in the string, length has to interrogate every byte to see if it is part of a longer character. (Incidentally, your timings look linear, not exponential to me). You can avoid this challenge if instead of storing the string as you encounter it, store it encoded:
    { package LenTestC; use Encode; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= Encode::encode_utf8($data); } sub len { my $self = shift; return length $$self; } }
    The target string never gets upgraded to UTF-8, and thus the fast length algorithm can be used.

    Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Many thanks for your valuable answer, I reproduced the same performance gains on my system, while not grasping the why.

      I was told that since utf-8 string management was natively integrated into Perl core a string has an internal flag to tell if it is utf-8 or not.

      When parsing an xml file declared as Encoding="utf-8", the strings parsed by the XML SAX Parser are not given in utf-8? (I did not noticed that because I do not display processed data, so if the string is given undecoded i guess it is written as-is, but I should double-check that)
      I seem to understand that the SAX writer might query many times the data size, so the overweight of encoding the data is compensated by the many calls to a length() that has better performance.
      But I do not see how length() is different depending on the string encoding: if the string is not in utf-8 length should return a byte size (1 byte per character) while on utf-8 we must process each byte to know if it is a simple char, a starting byte of a multi-byte char, or a continuation byte of a multi-byte char. I would have think that processing an utf-8 string has worse performance than a plain string...
      Note that your print/tell solution did the same kind of accounting, reporting bytes instead of characters.
      Yes, but that is not a problem as I am asked to split the xml on a file size basis (per 30, 100 or 200 MB chunks) so counting the bytes is ok.

        Encoding can be a challenge to get one's head around. When you read the strings in from your XML parsing, Perl pulls them in as a series of UTF-8 characters, and the string that contains them has the UTF-8 flag set to true. In order to determine the length of the string, each byte must be queried to determine to figure out how many characters are represented, thus the slow length.

        Invoking Encode::encode_utf8($data) returns the UTF-8 string transformed into the equivalent byte stream. Essentially, from Perl's perspective, it breaks the logical connection between the bytes, and leaves it as some combination of high bit and low bit characters. Now, since every record in the string is exactly 1 byte wide, the byte count requires no introspection.

        So:

        print length chr 199;
        outputs 1 while
        use Encode; print length Encode::encode_utf8(chr 199);
        outputs 2. Similarly, if you run
        say join ",", map ord, split //, chr 199;
        you output 199, while
        use Encode; say join ",", map ord, split //, Encode::encode_utf8(chr 199);
        outputs 195, 135.

        However, if your terminal is set to display UTF-8, printing both of those strings will output the same because the series of bits is unaffected.

        Does that help?


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: performance of length() in utf-8
by poj (Abbot) on Mar 03, 2016 at 17:38 UTC
    Can you confirm my analysis

    5-20 times slower is mentioned here Speed

Re: performance of length() in utf-8
by Anonymous Monk on Mar 03, 2016 at 19:46 UTC
    It seems to me that bigger is the string, longer is the time to get its size, (exponentialy?).
    Linearly. Perl_utf8_length is basically:
    while (start_ptr < end_ptr) { start_ptr += UTF8SKIP(start_ptr); len++; }
    and UTF8SKIP just returns the number of bytes in a utf-8 encoded codepoint given the starting byte. So, basically, PL_utf8_length is O(N), where N is the number of codepoins in (SvUTF8) string.

    Your LenTestA is exponential because it basically does

    PL_utf8_length($chunk) # first iteration PL_utf8_length($chunk . $chunk) # second iteration PL_utf8_length($chunk . $chunk . $chunk) # third iteration ...and so on...
    while LenTestB is
    PL_utf8_length($chunk) # 1st PL_utf8_length($chunk) # 2nd PL_utf8_length($chunk) # 3rd ...

      Your LenTestA is exponential
      That would be quadratic growth. You have a series of operations that grows linearly, and your number of invocations grows linearly with line length.

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        Maybe, maybe, I don't speak english very well. Basically, it's O(N**2).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1156744]
Approved by hippo
Front-paged by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2021-11-30 10:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?