perlmeditation
betterworld
<p>Dear monks,</p>
<p>
recently, I've learned something from [ikegami] in [id://711100].
</p>
<p>
Apparently, the memory that is (directly) occupied by my-variables is <strong>never
automatically freed</strong>. The term "directly" includes the case where a
my-variable holds a (very long) string.
</p>
<p>
As [ikegami] has [id://711103 |demonstrated], a string buffer will grow when
needed, but it will never (automatically) shrink or disappear. Apparently this is part of some kind of optimization to avoid constantly reallocating buffers for code that is used often.
</p>
<p>I've [id://711118 |shown] that a buffer will only be <strong>reused <em>for the
same</em></strong> variable. This means that it will not be reused for a variable in
another subroutine. This is were I see a big problem. In large programs there are thousands of lexical variables, and not all of them are used more than a few times, but all of them retain their buffers. <strong>Even for a buffer that it reused often</strong>, this is not optimal: Once it will have a large chunk of data, it will stay at that size, regardless how small the data is in the next calls.</p>
<p>
And I've done my homework and done a super search. As a matter of fact, this topic has been discussed before ([id://108482], [id://104920]).
</p>
<p>
The commonly suggested workarounds are:
</p>
<ul>
<li>undef-ing variables after using them. (Actually this is not always
practicable, e.g. in the case where you want to <tt>return</tt> the scalar)</li>
<li>Designing the code so that it works on references or aliases.</li>
</ul>
<p>Alright, but the problem is that code is generally not designed like this. Of
course, you can design your code this way if you plan to handle large data.
However, almost all serious projects use external code that they haven't written
themselves. In my search for the most obvious example, I found [mod://Encode|Encode.pm]:</p>
<readmore>
Consider this code:
<c>
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
sub init {
encode('utf8', 'x' x 100_000_000);
return ();
}
print "starting\n";
sleep 5;
print "initializing\n";
init();
print "initialized\n";
sleep 5;
print "cleaning\n";
undef &Encode::encode;
sleep 5;
</c>
<p>
If you watch this program's memory consumption, you'll find that it will use
approximately 288MB after "initialized" has been printed. After "cleaning" has
been printed, the amount will shrink considerably to 98MB. (Actually it will shrink
even more if you wipe out the "init" subroutine itself, I guess this is because
of the large string constant.)
</p>
<p>Responsible is this code in [mod://Encode|Encode.pm]:</p>
<c>
sub encode($$;$)
{
my ($name, $string, $check) = @_;
# ...
my $octets = $enc->encode($string,$check);
$_[1] = $string if $check and !($check & LEAVE_SRC());
return $octets;
}
</c>
<p>Both <c>$string</c> and <c>$octets</c> hold our dear string, and (like I) the author obviously thought that they don't need to free its memory.</p>
<p>
I've named the subroutine "init" to suggest that this is code that will only be
used at the very start of a long program lifetime, which means that the long
string buffer will linger around needlessly.
</p>
<p>So, what would I be supposed to do? Don't use Encode and <strong>do my character
transcoding myself?</strong> Or should I actually <strong>use "undef" to clean all the subs</strong> that
I have used? Consider that my initialization code loads an XML configuration
file. I'd have to clean most of the namespaces of XML::Simple, XML::Parser and
whatelse. And if I actually plan to continue using these modules, I'd have to
wipe out "%INC", then require them again, not very nice. (Just an example; I've
not really checked these modules, so please don't be offended if you are the
author and have considerately undef-ed every variable.)
</readmore>
<p>I've not looked for this optimization in the perl source yet, but I'd really
like someone to explain why it is needed. I can agree that it would not be performant
to do a lot of malloc/mmap/munmap/brk for every string that is copied, however
IMHO there are situations where perl should find some way to realize that any of
the following cannot be performant either:</p>
<ul>
<li>Holding (several) SVs in memory that occupy lots and lots of memory
pages</li>
<li>Holding like 1000 SVs in memory where neither has been used more than
once</li>
</ul>
<p>
I'll conclude with an example snippet that demonstrates how you can eventually
get your computer to use excessive amounts of memory or even swap:
</p>
<c>
perl -lwe 'my $code = join "", map {
"sub foo$_ { my \$var = q(x) x 1_000_000; }" } 1..1000;
eval $code; die if $@;
for (1..1000) { sleep 1; "foo$_"->() }'
</c>
<p>
Because of the "sleep", you can run this snippet and watch it indulge itself by
eating one megabyte per second (in GNU, use "top" and press M).</p>
<readmore>
<p>The snippet uses string-eval to generate a lot of subroutines like this:</p>
<c>
sub foo1 { my $var = q(x) x 1_000_000; }
sub foo2 { my $var = q(x) x 1_000_000; }
# and so forth...
</c>
then calls them one after another.
<p>Well, that was a large chunk of text now, I've tried to ease your reading by using bold text, I hope that perlmonks' buffers will eventually be freed from this text, and I hope that I haven't missed something obvious.</p>
</readmore>