Mano_Man has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have a question regarding the cost in performance of the functions say/print:
Is this alternative:
my toPrint="";
if(...)
$toPrint .= "Error message1\n"
if(...)
$toPrint .= "Error message2\n"
if ($toPrint)
print $toPrint;
better than just printing out said message on spot? This is of course regarding a much bigger toPrint message, with a much longer text.
Secondly - is there a limit on home much a string like toPrint can hold? Can I spam toPrint with, let's say: 20000 lines?
Is this even a good idea, or is there a better way ?
Appreciate any help O, great monks.
Mano.
Re: Performance In Perl
by Discipulus (Canon) on Mar 15, 2017 at 08:49 UTC
|
Hello Mano_Man and welcome to the monstery!
You'll receive for sure detailed answers but my feeling is that the maximum length of a string is based on your RAM: see Maximum string length
Then, in terms of performances i suspect it is much more convenient to print out such messages as soon as possible, without accumulating them into a variable.
Infact $toPrint will grow up consuming much RAM at any append you make to it.
In the other scenario, you do not even need a variable: you just print out a string and it is gone.
With modern hardware i suspect 20k lines are an affordable task, to print and to read, though.
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
There are really two questions to answer:
- Is it faster to call print/say once with several megabytes of data, or many more times with small amounts of data each time?
- How big a string can you build before you run out of memory
To answer the second question first, Perl has no built-in limits for string size. Try this simple test program using the splendid Devel::Size module:
use strict;
use warnings;
use Devel::Size qw(size);
my $str = '';
for (1 .. 1_000_000) {
$str .= "x" x 80;
}
print "ASCII string has @{[ length($str) ]} characters and consumes @{
+[ size($str) ]} bytes\n";
$str = '';
for (1 .. 1_000_000) {
$str .= "\N{U+1234}" x 80;
}
print "Unicode string has @{[ length($str) ]} characters and consumes
+@{[ size($str) ]} bytes\n";
On my machine, the output is
ASCII string has 80000000 characters and consumes 89779352 bytes
Unicode string has 80000000 characters and consumes 273984424 bytes
So even with one million 80-character lines, you're only using a couple hundred megabytes of RAM.
To answer the I/O speed question, you can try benchmarking it like the program below:
use strict;
use warnings;
use feature "say";
use Benchmark;
use Devel::Size qw(size);
my $t0 = Benchmark->new;
my $total_bytes_chunked = 0;
for (1 .. 1_000_000) {
my $str = 'x' x 80;
#$total_bytes_chunked += size($str);
say STDERR $str;
}
my $t1 = Benchmark->new;
my $str = '';
for (1 .. 1_000_000) {
$str .= 'x' x 80 . "\n";
}
my $total_bytes_lump = 0;
#$total_bytes_lump = size($str);
print STDERR $str;
my $t2 = Benchmark->new;
say "Printing in small chunks ($total_bytes_chunked bytes): @{[ timest
+r(timediff($t1, $t0)) ]}";
say "Printing one big chunk ($total_bytes_lump bytes): @{[ timestr(tim
+ediff($t2, $t1)) ]}";
If you run it, redirect STDERR or the comparison is meaningless: perl test.pl 2>/dev/null.
Be warned that it may be a false comparison nonetheless. On my machine, printing one big lump is faster than printing one million small chunks. However, if you uncomment the size() calls to see how much the total string sizes differ, you'll find the first loop suddenly takes four times longer, because it's doing a lot more calculation at each loop iteration.
Probably the only right way to answer your question is to try both in your program and with your input and see which one performs faster. It really depends on how much you can afford to keep in memory and how much computation you need to do for each individual chunk to print.
| [reply] [d/l] [select] |
|
Thank you for the speedy replies. I've just checked it - the difference is about 30% performance, which is a lot. Of course, in small prints, this is negligible.
Thank you !
| [reply] |
Re: Performance In Perl
by Eily (Monsignor) on Mar 15, 2017 at 09:51 UTC
|
It depends on where your prints go. In any case, there's probably some buffering going on (unless your handle is hot). If your output does not go directly to a terminal, that buffering pretty much does the same thing as you are trying to do, with the question of the size of the chuncks already handled. If your output goes to a terminal, then it is flushed every time a \n is encountered and you might benefit from constructing bigger messages. This is a candidate for Benchmarking as shown by vrk.
NB: I tried the following perl -E "for (1..10) { say 'Hello'; sleep(1); }" > test.txt and tail -f text.txt. It confirmed that redirecting STDOUT to a file removes the line buffering mode: even with $| = 0, the "Hello"s are displayed straightaway when printing to the console. IE: I got all the lines at once.
More information on buffering here
| [reply] [d/l] [select] |
Re: Performance In Perl
by kcott (Archbishop) on Mar 16, 2017 at 00:24 UTC
|
G'day Mano,
Welcome to the Monstery.
In general, calling print 20,000 times with individual records
will be slower than calling it once with all records.
I ran the following Benchmark several times.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use constant {
LINES => 20_000,
RECORD => 'X' x 100 . "\n",
};
use Benchmark 'cmpthese';
open my $fh, '>>', '/dev/null';
cmpthese 0 => {
singly => sub {
print $fh RECORD for 1 .. LINES;
},
concat => sub {
print $fh join '', (RECORD) x LINES;
},
list => sub {
print $fh +(RECORD) x LINES;
},
string => sub {
print $fh RECORD x LINES;
},
};
Here's a representative result:
Rate singly list concat string
singly 437/s -- -64% -71% -91%
list 1205/s 176% -- -20% -74%
concat 1497/s 243% 24% -- -68%
string 4720/s 981% 292% 215% --
You didn't give any indication of record size (error messages can vary wildly in length):
I just used 100 'X's (plus a newline).
If that's a reasonable guess, I don't imagine you'd have any problem with ~2MB of data
(either holding it in memory or passing it to print).
As you can see, printing every record singly was slower than the other methods.
A single print with concatenated records appears a little faster than using a list;
however, that wasn't the case in all runs: I'd consider these too close to call.
Also bear in mind that, because I've used constant values, Perl may have performed some
optimisations at compile time.
Consider what other code is involved as you capture records and add them to a string
or use them to populate an array.
There are some other factors to take into consideration.
Is this a one-off run? If not, how frequently is it run?
How long does the entire process take to run?
Is it being run by multiple processes at the same time?
Are there other users on the system? How might this affect them?
Although printing records individually may be slower in the benchmark scenario I present,
if done correctly, this method should have a substantially smaller memory footprint.
In addition, spreading the printing tasks over the life of the process,
may mean it plays more nicely with other, concurrent processes.
There's a fair amount to think about.
I'd recommend writing your own benchmark, using more representative data,
and running it in an environment that's closer to one in which the code will actually be run.
See also: "perlperf - Perl Performance and Optimization Techniques".
| [reply] [d/l] [select] |
|
Thank you for your excellent answer.
| [reply] |
Re: Performance In Perl
by Ratazong (Monsignor) on Mar 15, 2017 at 11:30 UTC
|
Hi Mano_Man
I assume it is mostly a design decision:
- do you need the info in $toPrint later?
- is it a benefit to use the say/print at only one place?
- do you want to give error-messages as soon as possible?
In my experience, the performance-differences are neglectable (but your use-case might be different).
However I have experienced an issue in the past with redirecting the output to a file, using the > in a windows batch file:
IIRC it didn't work when printing a text that was too long - it seemed that the data was sent out by print was much faster than the
writing to the file. Unfortunately, I can't find the old script where this happened... but it might a thing you want to check/consider.
HTH, Rata
| [reply] [d/l] [select] |
Re: Performance In Perl
by afoken (Chancellor) on Mar 17, 2017 at 07:25 UTC
|
Ignore the performance part of your code for now, you got plenty of good answers. But nobody has yet mentioned a much deeper problem:
my $errorlog='';
if (...) {
$errorlog.="Oops\n";
}
if (...) {
$errorlog.="Oh noes\n";
}
foo(...);
if (...) {
$errorlog.="Outch\n";
}
if ($errorlog ne '') {
print STDERR $errorlog;
} else {
say 'All done, bye.';
}
This is your basic idea, right?
Now imagine that this (pseudo-)script crashes seemingly randomly. Look into your error log. You see NOTHING. The script MUST NOT crash until the very end to write the error log. Unfortunately, crashes and bugs usually ignore such rules, and occour anywhere in your code.
Now, let's do it right. Don't print to STDERR, use warn and die as intended. Yes, both finally write to STDERR, but you can catch both if you want (eval, $SIG{__DIE__}, $SIG{__WARN__}). But that's not the point. The point is that STDERR is unbuffered. Everything you write there ends in the log, ASAP. So:
# We don't need that. It's just wrong: my $errorlog='';
if (...) {
warn 'Oops'; # Oh, by the way: Unless you add "\n", you will see f
+ile and line in the log.
}
if (...) {
warn 'Oh noes';
}
foo(...);
if (...) {
warn 'Outch";
}
say "All done, bye.";
Now, the last thing you see in the log is "Oh noes at example.pl line 20". And as it turns out in my example. the condition leading to this is a check to work around a known bug in an XS module that is triggered by a certain combination of input data to foo().
And that's why that three lines should read:
if (...) {
die 'Input data is f*cked up beyond repair. Died to prevent a cras
+h';
}
And now, buffered vs. unbuffered.
Perl files are usually buffered, and if only because the libc below perl buffers. Even if you spoon-feed a file character by character, perl and/or libc will usually buffer that until either the buffer is full or perl/libc decides that it's time to flush the buffer. As long as only the buffer is written, everything should be quite fast. Everything happens in memory in userspace. When the buffer is flushed, libc issues a syscall to actually write the file. The syscall switches to kernel mode, which is expensive, and the kernel does a lot of stuff to really write the file. This takes significantly more time.
An unbuffered file still uses a buffer, but it is automatically flushed after each write command. The syscall happens for every write command. This will obviously be slower than a buffered file, especially if you write character by character. But because the buffer is flushed after every write, a following crash in user space does not affect the log file. It has already been written.
What you have writen here is another buffering layer that is flushed only once, at the very end of the program. Does that improve the performance? Maybe a tiny bit. It blocks RAM that could be used for better purposes. This may become significant if you append lots of data to the buffer.
What does actually happen? Is writing the error log really the bottle neck? You can find that out. Devel::NYTProf is an excellent tool that shows you where your code really spends its time. That's where you really want to start optimizing.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
|
Yes, you have indeed a point with the buffering. I usually separate STDERR and STDOUT especially because of what you mentioned.
The idea was to look for a general rule of thumb, not a divine law.
| [reply] |
Re: Performance In Perl
by dsheroh (Monsignor) on Mar 16, 2017 at 08:37 UTC
|
How often will you be running this code?
Once?
Daily?
Even if you run it once a second, 24/7, for a year, the performance difference is likely to be so small that the total time saved over that entire year will be less than the time it took you to type out your question.
Optimizing for programmer time is generally far more effective than trying to micro-optimize for CPU time. | [reply] |
|
I do not agree. A code habit is a code habit, and is installed once. If you are used to writing scripts that intend to print out large bulks of data , let's say, a 700MB file, and do so multiple times a day, a 500% increase in performance as found a few comments back might well indeed be substantial. What about if multiple users use those scripts ? Let's say, a 100 people, as is the problem in my case. Would your answer change ?
| [reply] |
|
In this particular case, no, it would not change. "A 500% increase in performance" is a meaningless statistic without first identifying 500% of what and, in this case, that base number is too tiny to matter, especially in comparison to the time your program will need to spend reading the 700M file from disk and doing the actual processing of its contents before it's ready to print its output.
Also, while kcott's numbers above show a 981% difference (roughly 0.2ms vs 2ms for 20k lines of output, which is to say a fraction of a microsecond per line), I note that his test builds the long strings using the x operator instead of doing 20k individual concatenations. Let's see what happens if we actually build the output string line-by-line instead, as the code in your original post does it:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use constant {
LINES => 20_000,
RECORD => 'X' x 100 . "\n",
};
use Benchmark 'cmpthese';
open my $fh, '>>', '/dev/null';
my $out;
cmpthese 0 => {
kcott_by_line => sub {
print $fh RECORD for 1 .. LINES;
},
kcott_concat => sub {
print $fh RECORD x LINES;
},
append_per_line => sub {
$out = '';
$out .= RECORD for 1 .. LINES;
print $fh $out;
}
}
And the results:
Rate kcott_by_line append_per_line kcott_concat
kcott_by_line 716/s -- -20% -93%
append_per_line 900/s 26% -- -91%
kcott_concat 9690/s 1254% 976% --
Only a 26% difference between printing line-by-line and appending line-by-line. It seems that the primary optimization behind a single print being so much faster in kcott's test was that it built the entire output string in one operation instead of handling each line of output separately. Which is not an optimization that you would be able to apply in the case your question describes.
And, again translating this back into real numbers, the difference is 14.3 million lines/second printing them individually vs. an even 18 million lines/sec if they're concatenated first. 0.7 microseconds/line vs. 0.56 microseconds/line. A savings of approximately one second per 70 million lines of output. Over four billion lines to get a one-minute difference.
Whoopty-freaking-do.
How many times would each of those 100 users have to process their 700M input files for the aggregate difference to add up to the time you spent reading this reply, never mind the time I spent writing it?
This kind of micro-optimization is just not worth it in 99% of cases - and, for the other 1%, you'll get bigger gains by using C or a similar high-performance language instead of Perl, and then micro-optimizing the C code if you still need more speed at that point. | [reply] [d/l] [select] |
Re: Performance In Perl
by Anonymous Monk on Mar 15, 2017 at 15:48 UTC
|
If you want performance in your code, use Assembly and C.
If you want performance in your development time however ... use Perl. | [reply] |
|
Your answer has completely missed the mark. The problem addressed was refining PERL code performance. There could be quite a large amount of reasons why you'd want to stick with a specific language, and optimize it, instead of considering a lower level language instead.
| [reply] |
|
|