haukex has asked for the wisdom of the Perl Monks concerning the following question:
I think I've found an interesting optimization, but even after some digging here on PerlMonks, the Perl docs,
and looking through the Camel I haven't yet found it documented*. Of course, it's also possible that I made a mistake
in my benchmark, or that this optimization is common knowledge, in which case I would be happy to be enlightened :-)
<update>
* Because it isn't there :-) It appears that the answer is that while split is quite fast, it still splits the string
into an array before iterating over that array (that's what the memory consumption seems to show). The filehandle method
proposed by Laurent_R seems to be the best way to go about my task instead, assuming what you're splitting
on is a fixed string. See the all the replies below for more details.
</update>
I had a multiline string and wanted to iterate through the lines, and became curious what the fastest way to do that was.
I was pleasantly surprised that in my benchmark, on v5.24.1, for (split /\n/, $string) was fastest, despite my original
worry that it might split the string into a long list before iterating over it.
I know that foreach (1..1000000) has been optimized to use an iterator instead of building a huge list
since 5.005, and I found some discussion of the optimization of
split // in this thread.
I also found several references to @x = split ... being optimized, which I think might be the reason that
for (split ...) is so fast. I don't know how long this optimization has been present, but I found, for example,
commit e4e95921cd0fd0, which seems to indicate it's been
present since Perl 3. If anyone knows more and wants to set the record straight, please do so! :-)
use warnings;
use strict;
use Data::Dump qw/dd pp/;
use Benchmark qw/cmpthese/;
# example output:
# 5.024001
# Rate regex index split
# regex 9.64/s -- -8% -41%
# index 10.5/s 9% -- -36%
# split 16.4/s 70% 56% --
my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000;
use constant TEST => 0;
my $expect = join "\0", split /\n/, $str;
$expect=~s/o/i/g;
#dd [split /\n/, $str], $expect;
dd $];
cmpthese(-2, {
split => sub {
my @lines;
my @x = split /\n/, $str;
#@x = map {$_} @x; # significant slowdown
#for my $line (map {$_} split /\n/, $str) { # still fairly fas
+t
for my $line (@x) {
$line=~s/o/i/g;
push @lines, $line;
}
if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines
+ }
},
regex => sub {
my @lines;
pos($str)=0;
#while ($str=~/^(.*)$/mgc) { # slower
while ($str=~/\G(?|(.*?)\n|(.+)\z)/gc) {
my $line = $1;
$line=~s/o/i/g;
push @lines, $line;
}
if (TEST) {
die unless pos($str)==length($str);
die pp(@lines) unless $expect eq join "\0", @lines;
}
},
index => sub {
my @lines;
for ( my ($pos,$nextpos) = (0); $pos<length($str);
$pos=$nextpos+1 ) {
$nextpos = do { my $i=index($str,"\n",$pos);
$i<0?length($str):$i };
my $line = substr $str, $pos, $nextpos-$pos;
$line=~s/o/i/g;
push @lines, $line
}
if (TEST) { die pp(@lines) unless $expect eq join "\0", @lines
+ }
},
});
Re: Is foreach split Optimized?
by Laurent_R (Canon) on Jul 09, 2017 at 11:56 UTC
|
filehandle => sub {
my @lines;
open my $str_fh, "<", \$str or die "cannot open fh $!";
while (<$str_fh>) {
chomp;
s/o/i/g;
push @lines, $_;
}
},
This appears to be faster than your other subs, including the one with split (note that I had to change the count value for the result to be meaningful):
$ perl bench_split.pl
s/iter regex index split filehandle
regex 3.44 -- -9% -36% -48%
index 3.13 10% -- -29% -43%
split 2.21 56% 42% -- -19%
filehandle 1.78 93% 76% 24% --
| [reply] [d/l] [select] |
|
5.024001
Rate regex index filehandle split
regex 9.65/s -- -7% -34% -39%
index 10.4/s 8% -- -29% -34%
filehandle 14.6/s 51% 40% -- -7%
split 15.7/s 63% 51% 8% --
I haven't yet gotten around to experimenting with other versions of Perl. | [reply] [d/l] |
|
This was with version 5.14, using Cygwin. I will try with a more recent version under Windows.
Update This very strange: running the exact same program on the same computer but under a pure Windows setting and perl 5, version 24, subversion 1 (v5.24.1) and from a bash console, I obtain awfully bad results for the file handle solution:
s/iter filehandle regex index split
filehandle 100 -- -97% -97% -98%
regex 3.50 2758% -- -14% -42%
index 3.02 3217% 16% -- -33%
split 2.02 4868% 74% 50% --
I can only suspect that there might be something wrong with the management of Windows end-of-line pairs of characters.
Still under Windows, same program, from a dos windows, with version v5.16.3:
s/iter regex index split filehandle
regex 3.69 -- -2% -29% -39%
index 3.63 2% -- -27% -38%
split 2.63 40% 38% -- -15%
filehandle 2.25 64% 61% 17% --
I also tried (again on Cygwin with Perl version v5.14.4) this additional sub:
filehandle2 => sub {
open my $str_fh, "<", \$str or die "cannot open fh $!";
my @lines = <$str_fh>;
for (@lines) {
chomp;
s/o/i/g;
;
}
},
but the result is not so good as the first filehandle solution (but still better than the others):
$ perl bench_split.pl
s/iter regex index split filehandle2 fi
+lehandle
regex 3.44 -- -10% -35% -45%
+ -48%
index 3.11 11% -- -28% -39%
+ -42%
split 2.24 53% 39% -- -16%
+ -20%
filehandle2 1.89 82% 65% 19% --
+ -5%
filehandle 1.80 92% 73% 25% 5%
+ --
Finally, I also tried this solution with a map:
filehandle3 => sub {
open my $str_fh, "<", \$str or die "cannot open fh $!";
my @lines = map { chomp; s/o/i/g } <$str_fh>;
},
and thought it might be faster, but this turns out to be slower than all the other solutions:
$ perl bench_split.pl
s/iter filehandle3 regex index split fi
+lehandle
filehandle3 3.64 -- -7% -16% -39%
+ -51%
regex 3.38 8% -- -9% -34%
+ -47%
index 3.07 18% 10% -- -28%
+ -42%
split 2.22 64% 52% 38% --
+ -20%
filehandle 1.77 105% 90% 73% 25%
+ --
| [reply] [d/l] [select] |
|
perlbrew exec bench_script.pl
perl-5.10.1
==========
Rate regex index split filehandle
regex 3.95/s -- -7% -40% -53%
index 4.27/s 8% -- -35% -50%
split 6.54/s 66% 53% -- -23%
filehandle 8.46/s 114% 98% 29% --
perl-5.12.5
==========
Rate index regex split filehandle
index 3.96/s -- -10% -39% -49%
regex 4.41/s 11% -- -32% -43%
split 6.50/s 64% 47% -- -16%
filehandle 7.69/s 94% 74% 18% --
perl-5.14.4
==========
Rate index regex split filehandle
index 3.67/s -- -14% -47% -49%
regex 4.25/s 16% -- -38% -41%
split 6.86/s 87% 62% -- -5%
filehandle 7.25/s 97% 71% 6% --
perl-5.16.3
==========
Rate index regex filehandle split
index 3.23/s -- -19% -45% -50%
regex 3.98/s 23% -- -32% -38%
filehandle 5.83/s 81% 46% -- -9%
split 6.44/s 100% 62% 10% --
perl-5.18.4
==========
Rate index regex split filehandle
index 3.62/s -- -1% -45% -46%
regex 3.65/s 1% -- -44% -46%
split 6.57/s 82% 80% -- -3%
filehandle 6.76/s 87% 85% 3% --
perl-5.20.3
==========
Rate index regex split filehandle
index 3.50/s -- -10% -44% -45%
regex 3.90/s 11% -- -38% -38%
split 6.25/s 79% 60% -- -1%
filehandle 6.31/s 80% 62% 1% --
perl-5.22.3
==========
Rate index regex split filehandle
index 3.62/s -- -0% -38% -48%
regex 3.64/s 0% -- -38% -47%
split 5.88/s 63% 62% -- -15%
filehandle 6.90/s 91% 90% 17% --
perl-5.24.1
==========
Rate regex index filehandle split
regex 2.97/s -- -21% -42% -59%
index 3.77/s 27% -- -26% -47%
filehandle 5.08/s 71% 35% -- -29%
split 7.18/s 142% 90% 41% --
perl-5.26.0
==========
Rate index regex split filehandle
index 3.00/s -- -25% -49% -53%
regex 3.98/s 33% -- -33% -37%
split 5.91/s 97% 49% -- -6%
filehandle 6.31/s 111% 59% 7% --
| [reply] [d/l] |
|
Re: Is foreach split Optimized?
by ikegami (Patriarch) on Jul 09, 2017 at 17:44 UTC
|
I also found several references to @x = split ... being optimized
That's just a normal call to `split`. It's actually the following that's optimized:
($x, $y, $z) = split ...;
It only finds and returns the first three items instead of all of them. Obviously, that optimization can't be used for
@a = split ...;
This post lists the for optimizations. | [reply] [d/l] [select] |
|
Thank you for the link! Taken together with the AM's post here, it does seem that I was just surprised by the speed of split, causing me to initially read too much into the commit messages regarding @x = split, like 692044df and 5012eebe.
| [reply] [d/l] |
Re: Is foreach split Optimized?
by Marshall (Canon) on Jul 09, 2017 at 11:32 UTC
|
I am curious about your benchmark program.
With my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 500000;
I get:
5.020002
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
s/iter index regex split
index 26.5 -- -61% -66%
regex 10.3 158% -- -12%
split 9.05 193% 13% --
which basically means that the results don't matter.
Code as with
my $str = "\nFoo\n\nBar Quz\nBaz\nx" x 50000;
Your benchmark default.
5.020002
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate index regex split
index 1.16/s -- -21% -34%
regex 1.47/s 27% -- -16%
split 1.74/s 51% 19% --
I am not sure what this means if anything. | [reply] [d/l] [select] |
|
(warning: too few iterations for a reliable count)
You can get rid of those warnings by increasing the runtime by saying e.g. cmpthese(-5, .... I did fiddle with it quite a bit, adjusting the string length, runtime, etc. and the results were always consistent that split was fastest by a significant margin.
| [reply] [d/l] |
|
5.020002
Rate index regex split split2
index 1.17/s -- -23% -35% -43%
regex 1.51/s 29% -- -16% -26%
split 1.79/s 53% 18% -- -12%
split2 2.03/s 74% 35% 14% --
split2 uses tr/// instead of regex s///...
| [reply] [d/l] [select] |
|
|
Re: Is foreach split Optimized?
by Anonymous Monk on Jul 09, 2017 at 20:05 UTC
|
The question seems to be, "does perl split the entire string before beginning the for loop?" We can answer that by looking at how much memory it is using.
use strict;
use warnings;
my $n = 10_000_000;
system "ps -orss $$";
my $x = 'foo' x $n;
system "ps -orss $$";
for my $y (split /f/, $x) {
system "ps -orss $$";
last;
}
The answer seems to be, "yes."
RSS
3712
RSS
33028
RSS
674900
| [reply] [d/l] [select] |
|
Thank you, this is probably the best way of looking at it! I took your idea and expanded on it, below. It seems that the answer to my original question is No, there isn't any magic going on with for (split ...), it's just that while it may be a memory hog, split is still pretty fast - that's what threw me off initially.
#!/usr/bin/env perl
use warnings;
use 5.014;
my $cnt = 5000000;
system($^X,'-sE',<<'_END_','--',"-cnt=$cnt");
say `ps -orss $$`=~s/\s+/ /gr; # RSS 4340
$x = "abc\n" x $cnt;
say `ps -orss $$`=~s/\s+/ /gr; # RSS 23936
@y = ("abc") x $cnt;
say `ps -orss $$`=~s/\s+/ /gr; # RSS 455604
_END_
say "---";
system($^X,'-sE',<<'_END_','--',"-cnt=$cnt");
say `ps -orss $$`=~s/\s+/ /gr; # RSS 4544
$x = "abc\n" x $cnt;
say `ps -orss $$`=~s/\s+/ /gr; # RSS 23972
@y = split /\n/, $x;
say `ps -orss $$`=~s/\s+/ /gr; # RSS 423544
_END_
say "---";
system($^X,'-sE',<<'_END_','--',"-cnt=$cnt");
say `ps -orss $$`=~s/\s+/ /gr; # RSS 4540
$x = "abc\n" x $cnt;
say `ps -orss $$`=~s/\s+/ /gr; # RSS 23964
for (split /\n/, $x) {
say `ps -orss $$`=~s/\s+/ /gr; # RSS 457512
last }
_END_
say "---";
system($^X,'-sE',<<'_END_','--',"-cnt=$cnt");
say `ps -orss $$`=~s/\s+/ /gr; # RSS 4336
$x = "abc\n" x $cnt;
say `ps -orss $$`=~s/\s+/ /gr; # RSS 23932
open my $fh, '<', \$x or die $!;
while (<$fh>) {
say `ps -orss $$`=~s/\s+/ /gr; # RSS 24000
last }
_END_
| [reply] [d/l] [select] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|