Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Splitting a string to chunks

by spurperl (Priest)
on Nov 29, 2006 at 13:33 UTC ( [id://586695]=perlquestion: print w/replies, xml ) Need Help??

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Something that comes up fairly often is a need to split a string to equal sized chunks. For instance, given the string "abcdefgh12345678", splitting it to 4-char chunks would produce ("abcd", "efgh", "1234", "5678"). Looking around the monastery, there're at least a couple of posts I have found.

I tried to time some different techniques against each other:

my $str = "abcdefgh12345678" x 20; my $strlen = length $str; cmpthese(50000, { 'grep_split' => sub { my @arr = grep {$_} split /(.{8})/, $str; }, 'split_pos' => sub { my @arr = split /(?(?{pos() % 8})(?!))/, $str; }, 'substr_map' => sub { my $len = length $str; my @arr = map {substr($str, $_ * 8, 8)} (0 .. $strlen / 8 - 1); }, 'substr_loop' => sub { my @arr; my $len = length $str; for (my $i = 0; $i < $len; $i += 8) { push(@arr, substr($str, $i, 8)); } }, 'unpack' => sub { my @arr = unpack('(A8)*', $str); } });

And the results are quite surprising:

Rate split_pos 3203/s grep_split 6425/s substr_map 8889/s unpack 11348/s substr_loop 15097/s

Contrary to what I have expected from my understanding (that built in functions should be faster than loops), the looping solution is the swiftest. It beats the unpack by a margin ranging from 15 to 50 percent, depending on the length of the string and the chunks.

Any way to make it faster ?

Replies are listed 'Best First'.
Re: Splitting a string to chunks
by Limbic~Region (Chancellor) on Nov 29, 2006 at 13:50 UTC
    spurperl,
    I was suprised to see that unpack wasn't the fastest so I changed it just a bit.
    my @arr = unpack('A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8', $str);
    Not only is that compatible with older perl's - it is now the fastest. I might play a bit more to see if I can get an even faster version but to be fair, that really should have been:
    # No longer wins but is still faster than '(A8)*' my @arr = unpack((join '', ('A8' x ($strlen / 8))), $str);

    Update: I wanted to see what would happen if the benchmark focused more on the functions themselves by removing some of the intermediate calculations. Noticed also I changed x 20 to x 200.

    Cheers - L~R

Re: Splitting a string to chunks
by duff (Parson) on Nov 29, 2006 at 13:52 UTC

    On my system, I get a different result:

                   Rate   split_pos  grep_split  substr_map substr_loop      unpack
    split_pos    4596/s          --        -53%        -71%        -79%        -82%
    grep_split   9843/s        114%          --        -37%        -54%        -61%
    substr_map  15674/s        241%         59%          --        -27%        -38%
    substr_loop 21459/s        367%        118%         37%          --        -15%
    unpack      25381/s        452%        158%         62%         18%          --
    
    Your performance characteristics depend on all sorts of things relating to your CPU, its cache, bus speed, memory, etc.

    But as far as ways to make it faster, you might want to use an idiomatic for loop instead of the C-style loop.

Re: Splitting a string to chunks
by Fengor (Pilgrim) on Nov 29, 2006 at 13:48 UTC
    what about
    'regex' => sub { my @arr = $str =~ /(........)/g }
    didn't time it though.

    --
    "WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
    -- Terry Pratchett, "Reaper Man"

      Hi,

      I added another version, that split string that split with a smaller last chunk. Added also a /o, to improve performace (that can be used if you have several lines to split.

      Added this to the benchmark:

      'regex' => sub { my @arr = $string =~ /(........)/g; }, 'regexo' => sub { my @arr = $string =~ /(.{1,8})/og; },
      The results:
      Rate split_pos split grep_split substr_map substr_lo +op unpack regex regexo split_pos 7295/s -- -57% -60% -68% -7 +7% -78% -100% -100% split 16900/s 132% -- -7% -26% -4 +7% -50% -100% -100% grep_split 18241/s 150% 8% -- -20% -4 +3% -46% -100% -100% substr_map 22883/s 214% 35% 25% -- -2 +9% -32% -99% -100% substr_loop 32139/s 341% 90% 76% 40% +-- -4% -99% -99% unpack 33495/s 359% 98% 84% 46% +4% -- -99% -99% regex 4342185/s 59421% 25593% 23705% 18876% 1341 +1% 12864% -- -6% regexo 4596612/s 62909% 27098% 25099% 19988% 1420 +2% 13623% 6% --
        umhmm you got my typo. i accidentally used $string instead of $str in my post first. that explains the high rates for the regex solution. here is the timing with the typo corrected:
        Rate split_pos grep_split substr_map regexo regex subst +r_loop unpack split_pos 5587/s -- -65% -69% -76% -77% + -79% -81% grep_split 15974/s 186% -- -12% -32% -34% + -40% -45% substr_map 18051/s 223% 13% -- -23% -26% + -32% -38% regexo 23474/s 320% 47% 30% -- -3% + -12% -20% regex 24272/s 334% 52% 34% 3% -- + -9% -17% substr_loop 26596/s 376% 66% 47% 13% 10% + -- -9% unpack 29240/s 423% 83% 62% 25% 20% + 10% --

        --
        "WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
        -- Terry Pratchett, "Reaper Man"

        themage,
        Your benchmark disagrees with mine (with x 20 and x 200). Additionally, I think you should re-read perlre with regards to what /o does.

        I am sure diotalevi will improve upon my explanation but in a nutshell, /o is an old optimization predating qr//. If you needed to interpolate a variable inside a regex such as /$regex/ but knew that $regex would never change, the flag would tell perl to only compile the regex once. In fact, if you broke your promise and changed $regex then it would still not recompile it leading to buggy code. Then came along qr// and improved things greatly (see /o is dead, long live qr//!).

        Since you are not using a variable in your interpolation - the /o is having no effect.

        See also this regarding how current perl's optimize regex compiling. Unfortunately I couldn't seem to find this in any perldelta from 5.6.1 to 5.9.4 which makes me suspicious so I posted Questions concerning /o regex modifier.

        Cheers - L~R

        Without having run your benchmark, the huge disparity between your solutions and the others make me very suspicious that your code is not producing the same results as the others. Have you checked?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting a string to chunks
by Not_a_Number (Prior) on Nov 29, 2006 at 15:15 UTC
    Any way to make it faster ?

    On my machine, this is slightly faster still:

    'substr_loop2' => sub { my @arr; my $s = $str; push @arr, substr $s, 0, 8, '' while $s; },

    More seriously, though, not all the subs in your OP are equivalent: 'substr_map' will truncate any string at a multiple of eight characters, while the others will include the extra characters in the final element of the array (Fengor's my @arr = $str =~ /(........)/g has the same problem).

      thx for pointing out. what about
      'regexpad' => sub { # padding the string my $padding = 8 - length($str%8) if length($str%8); #has to be 8 - m +odulo not modulo, thx johngg $str .= "x" x $padding; # dividing the string in parts of 8 chars my @arr = $str =~ /(........)/g; #remove padding $arr[-1] = substr($arr[-1],-$padding); }
      although its a bit slower than the other 2 regex solutions
      Rate split_pos grep_split substr_map regexpad regexo re +gex substr_loop unpack split_pos 5841/s -- -64% -70% -75% -76% - +76% -79% -80% grep_split 16129/s 176% -- -17% -30% -33% - +34% -42% -45% substr_map 19531/s 234% 21% -- -15% -19% - +20% -30% -33% regexpad 22936/s 293% 42% 17% -- -5% +-6% -17% -22% regexo 24038/s 312% 49% 23% 5% -- +-1% -13% -18% regex 24272/s 316% 50% 24% 6% 1% + -- -13% -17% substr_loop 27778/s 376% 72% 42% 21% 16% +14% -- -5% unpack 29240/s 401% 81% 50% 27% 22% +20% 5% --
      Edit: corrected padding

      --
      "WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
      -- Terry Pratchett, "Reaper Man"

        I think your "padding" algorith might be a bit wonky. Given a string of length, say, 19 characters you would arrive at a $padding value of 3, thus padding your $str with three "x"s to end up with a length of 22, not 24 as I think you wanted. This should work (not tested)

        my $padding = 8 - ($str % 8);

        The remove padding part would be something like (again, not tested)

        substr $arr[-1], -$padding, $padding, q{} if $padding;

        Cheers,

        JohnGG

        Update: I must have been half-asleep; where's the length call? Line should be

        my $padding = 8 - (length($str) % 8);

        You can't do modulo on a string :)

        $ perl -le '$str = q{abc}; $pad = $str % 8; print $pad;' 0 $ perl -le '$str = q{abcdefghijkl}; $pad = $str % 8; print $pad;' 0 $

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://586695]
Approved by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-04-16 05:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found