No garbage collection for my-variables

Dear monks,

recently, I've learned something from ikegami in Re: out of memory problem.

Apparently, the memory that is (directly) occupied by my-variables is never automatically freed. The term "directly" includes the case where a my-variable holds a (very long) string.

As ikegami has demonstrated, a string buffer will grow when needed, but it will never (automatically) shrink or disappear. Apparently this is part of some kind of optimization to avoid constantly reallocating buffers for code that is used often.

I've shown that a buffer will only be reused for the same variable. This means that it will not be reused for a variable in another subroutine. This is were I see a big problem. In large programs there are thousands of lexical variables, and not all of them are used more than a few times, but all of them retain their buffers. Even for a buffer that it reused often, this is not optimal: Once it will have a large chunk of data, it will stay at that size, regardless how small the data is in the next calls.

And I've done my homework and done a super search. As a matter of fact, this topic has been discussed before (Garbage collection of 'my' variables, Re: Tracking Memory Leaks).

The commonly suggested workarounds are:

undef-ing variables after using them. (Actually this is not always practicable, e.g. in the case where you want to return the scalar)
Designing the code so that it works on references or aliases.

Alright, but the problem is that code is generally not designed like this. Of course, you can design your code this way if you plan to handle large data. However, almost all serious projects use external code that they haven't written themselves. In my search for the most obvious example, I found Encode.pm:

Consider this code:

#!/usr/bin/perl
use strict;
use warnings;
use Encode;


sub init {
    encode('utf8', 'x' x 100_000_000);

    return ();
}

print "starting\n";
sleep 5;
print "initializing\n";
init();
print "initialized\n";
sleep 5;
print "cleaning\n";
undef &Encode::encode;
sleep 5;
[download]

If you watch this program's memory consumption, you'll find that it will use approximately 288MB after "initialized" has been printed. After "cleaning" has been printed, the amount will shrink considerably to 98MB. (Actually it will shrink even more if you wipe out the "init" subroutine itself, I guess this is because of the large string constant.)

Responsible is this code in Encode.pm:

sub encode($$;$)
{
    my ($name, $string, $check) = @_;
# ...
    my $octets = $enc->encode($string,$check);
    $_[1] = $string if $check and !($check & LEAVE_SRC());
    return $octets;
}
[download]

Both $string and $octets hold our dear string, and (like I) the author obviously thought that they don't need to free its memory.

I've named the subroutine "init" to suggest that this is code that will only be used at the very start of a long program lifetime, which means that the long string buffer will linger around needlessly.

So, what would I be supposed to do? Don't use Encode and do my character transcoding myself? Or should I actually use "undef" to clean all the subs that I have used? Consider that my initialization code loads an XML configuration file. I'd have to clean most of the namespaces of XML::Simple, XML::Parser and whatelse. And if I actually plan to continue using these modules, I'd have to wipe out "%INC", then require them again, not very nice. (Just an example; I've not really checked these modules, so please don't be offended if you are the author and have considerately undef-ed every variable.)

I've not looked for this optimization in the perl source yet, but I'd really like someone to explain why it is needed. I can agree that it would not be performant to do a lot of malloc/mmap/munmap/brk for every string that is copied, however IMHO there are situations where perl should find some way to realize that any of the following cannot be performant either:

Holding (several) SVs in memory that occupy lots and lots of memory pages
Holding like 1000 SVs in memory where neither has been used more than once

I'll conclude with an example snippet that demonstrates how you can eventually get your computer to use excessive amounts of memory or even swap:

perl -lwe 'my $code = join "", map {
   "sub foo$_ { my \$var = q(x) x 1_000_000; }" } 1..1000;
   eval $code; die if $@;
   for (1..1000) { sleep 1; "foo$_"->() }'
[download]

Because of the "sleep", you can run this snippet and watch it indulge itself by eating one megabyte per second (in GNU, use "top" and press M).

The snippet uses string-eval to generate a lot of subroutines like this:

sub foo1 { my $var = q(x) x 1_000_000; }
sub foo2 { my $var = q(x) x 1_000_000; }
# and so forth...
[download]

then calls them one after another.

Well, that was a large chunk of text now, I've tried to ease your reading by using bold text, I hope that perlmonks' buffers will eventually be freed from this text, and I hope that I haven't missed something obvious.

Comment on No garbage collection for my-variables Select or Download Code

Replies are listed 'Best First'.
Re: No garbage collection for my-variables by Joost (Canon) on Sep 15, 2008 at 20:13 UTC
A program has only a limited number of lexical variables, but may process an unlimited amount data. It's the case anyway that for large strings (which is the only case we need to consider) it's much more efficient to pass around references. And code that expects to deal with very long strings generally does that, or encapsulates the strings in an object or deals with file handles directly. Copying 500Mb strings around would be stupid not just for memory reasons even if all the memory gets reclaimed when the variables holding them go out of scope. You really do want to pay attention to what you're doing when dealing with large chunks of memory. Perl is optimizing here for the cases where you want fast, repeated processing of strings no larger than say 10% of your memory. If you need to process larger strings, you'll have to pay attention anyway, and automatically clearing all scalars won't really help much (and it would dramatically slow down the general case). I don't see the current behaviour changing until someone completes a perl with a garbage collector instead of the current refcounting scheme. That would be perl 6, so it may take a while. update: I just wanted to mention that although all of this is interesting in a way, it's very unlikely that this behaviour has given you any actual problems. Just don't slurp in giant files, or Encode a whole dictionary in one call. What's wrong with reading and writing stuff line by line? That way, you can run thousands of those programs at once without any problem (or a couple at once, so as to actually use your CPU for something useful, instead of waiting for the drive to catch up). "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: No garbage collection for my-variables by kyle (Abbot) on Sep 15, 2008 at 20:47 UTC
I don't see the current behaviour changing until someone completes a perl with a garbage collector instead of the current refcounting scheme. The OP is saying that you can allocate a large string, let the variable go out of scope, and the memory is not freed and not reused. The memory allocated to the variable "sticks" to it even if you never use it again. (If I have this wrong, betterworld, please correct me.) I don't see what garbage collection has to do with this. The strings in question don't have any references to them, so the reference counter shouldn't have any problem knowing that they're not in use. I don't know what method perl uses to grow strings. The general method I recall from my CS classes was to double the size of a string when it grows out of its buffer and halve it when it shrinks to less than a quarter of the buffer size. Maybe someone more familiar with the internals can shed some light on why that wouldn't be a good design choice for Perl.	[reply]
Re^3: No garbage collection for my-variables by Joost (Canon) on Sep 15, 2008 at 20:55 UTC
I don't see what garbage collection has to do with this. The strings in question don't have any references to them, so the reference counter shouldn't have any problem knowing that they're not in use. Reference counting has everything to do with it, since it means that the only time perl can free the memory is when the last reference to the scalar goes out of scope. All without knowing if that scalar is every going to be reused. That means it either has to keep it there always, or free it always (or do some kind of heuristic, which should usually mean keep it, since allocating memory is expensive, and if you're using a large string now, chances are, you'll be using a large string again some time soon). What perl currently cannot do, is free "old, unused" scalars when it's running out of memory. It has to decide when the scalar is going out of scope. allocating and freeing each scalar every time that happens would probably slow down the interpreter a lot. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^4: No garbage collection for my-variables by betterworld (Curate) on Sep 15, 2008 at 23:01 UTC
Re^4: No garbage collection for my-variables by kyle (Abbot) on Sep 16, 2008 at 16:07 UTC
Re^5: No garbage collection for my-variables (possibilities) by tye (Sage) on Sep 16, 2008 at 21:00 UTC
Re^3: No garbage collection for my-variables by ikegami (Patriarch) on Sep 16, 2008 at 08:03 UTC
The strings in question don't have any references to them Not true. The pad that refers to them when the function is being executed still refers to them when the function isn't being executed. It could be changed to be true, so this nit pick is not relevant to the conversation.	[reply]
Re: No garbage collection for my-variables by zentara (Archbishop) on Sep 15, 2008 at 20:38 UTC
You might be interested in OS memory reclamation with threads on linux. I'm no expert, but it seems that Perl uses some internal calculator, to determine when, and how much memory to free back to the system. It is clearly seen in the above node, where a memory-heavy thread is almost totally released back to the system, but with light-weight threads, it is held onto. I was musing the other day, that it would be a neat feature to have a "forced demalloc" on threads, where you could specify an option to free all memory used by a thread once it's done, damn the refcount. I would like that option, as it would then be easy to reclaim memory just by putting it in a thread, and specifying "free_all". Possibly warnings may be issued, but another "no warnings:free_all" could be used. I'm not really a human, but I play one on earth Remember How Lucky You Are	[reply]
Re: No garbage collection for my-variables by repellent (Priest) on Sep 16, 2008 at 02:31 UTC
Another workaround is to re-exec the program, as outlined in: How can I free an array or hash so my program shrinks?	[reply]
Re^2: No garbage collection for my-variables (exec code) by tye (Sage) on Sep 16, 2008 at 04:16 UTC
Hmm, that FAQ answer could do with some code: `exec( $^X, $0, @ARGV ) or die "Can't execute self so killing self: $!\n";` [download] - tye	[reply] [d/l]
Re^3: No garbage collection for my-variables (exec code) by repellent (Priest) on Sep 16, 2008 at 04:38 UTC
Nice! Also remember: don't `shift` your `@ARGV` ;-) But seriously, wouldn't it be more involved since we need to consider saving the program "state" and resume it somehow?	[reply] [d/l] [select]
Re^4: No garbage collection for my-variables (exec code) by tye (Sage) on Sep 16, 2008 at 04:42 UTC
Re: No garbage collection for my-variables by CountZero (Bishop) on Sep 16, 2008 at 16:25 UTC
++ for this very interesting post. I guess it is all a matter of design choices. The `my` variables are not specially made for the purpose to release memory when they go out of scope. May be erroneously some of us may have thought/hoped/wished they were, but as you clearly showed, they are not. Rather their design is to encapsulate them within their scope and not "pollute" the variable-namespace outside of it. And they do that very well. As they are mostly used in loops and sub-routines, there is a good argument to be made in favour of speed above memory-consumption. You really do not want your tight running loop to get slowed down by repeated (de-)allocation of memory! Also, I have been using Perl for many years now and never got any of the memory-issues you mention. The examples you give are correct, but --IMO-- marginal or degenerate situations. Still we should not be blind for these issues and if you have a memory hungry program, you indeed may have to program very careful so as not to exhaust your memory. Thank you for reminding the Monastery of this! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re: No garbage collection for my-variables by BrowserUk (Patriarch) on Sep 16, 2008 at 18:01 UTC
Maybe it's time for the fabled use less to allow this memory-for-speed optimisation to be disabled? That said, most of the types of routines for which this could become a significant problem, things like your examples of encode and decode that take string and return it modifed in some way, ought to be written to use the pass-by-reference aliasing affects of `@_` anyway. It would make this 'problem' go away. Of course, an orthodoxy has grown up around this place that pass-by-reference and side-effects are some how bad karma and that directly accessing `@_` is premature optimisation. That modifying your arguments is bad because it is action at a distance that can surprise the caller. But, as long as subroutines are documented as modifying their argument(s), it really does make the most sense in many cases. The caller knows what subsequent use it will make of the arguments it passes you, and if it needs for them to be preserved, it can make copies as and when it needs to. Which makes more sense than every subroutine, copying every parameter, every time, 'just in case'. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: No garbage collection for my-variables by kyle (Abbot) on Sep 16, 2008 at 19:11 UTC
In addition to moritz's excellent point that a function that modifies its arguments then could not be called with a literal, I'd also point out that a lot of Perl programmers probably don't know that `@_` is full of aliases. I'd been programming in Perl off and on for over ten years before I came to the Monastery and learned that `@_` is aliases. I've asked about this feature in interviews I've conducted, and the prospects out there have always been surprised at this feature. Documentation helps, of course, but someone who doesn't know this is possible could spend an awful lot of time debugging before discovering this (as you say) action at a distance. Thumbs up on the use less, however.	[reply] [d/l] [select]
Re^3: No garbage collection for my-variables by BrowserUk (Patriarch) on Sep 16, 2008 at 22:10 UTC
Done right, you can have both (see Re^3: No garbage collection for my-variables). That way, the unaware are not caught out, but when the facility is needed it is available. It's the same mechanism that sort uses for in-place sorting in 5.10. I've thought about patching List::Util`::shuffle()` in the same way. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: No garbage collection for my-variables by ikegami (Patriarch) on Sep 17, 2008 at 03:10 UTC
Re^4: No garbage collection for my-variables by betterworld (Curate) on Sep 16, 2008 at 22:35 UTC
Re^5: No garbage collection for my-variables by BrowserUk (Patriarch) on Sep 17, 2008 at 00:08 UTC
Re^3: No garbage collection for my-variables by shmem (Chancellor) on Sep 16, 2008 at 22:48 UTC
that a function that modifies its arguments then could not be called with a literal There are edge cases. See foreach funny business.	[reply]
Re^2: No garbage collection for my-variables by moritz (Cardinal) on Sep 16, 2008 at 18:55 UTC
There's much more perlish reason not modify the arguments of sub by default. If you don't, you can write stuff like this: `other_function(decode 'latin-1', 'string_literal')) # and if you want to change a variable $var = decode('latin-1', $var);` [download] On the other hand if you do change the the arguments of the sub, the first one requires another variable, which is a real kludge (visually, at least) `do { my $var = 'string_literal'; decode('latin-1', $var); other_function($var); } # and the other one decode('latin-1', $var)` [download]	[reply] [d/l] [select]
Re^3: No garbage collection for my-variables by BrowserUk (Patriarch) on Sep 16, 2008 at 21:57 UTC
I think that you've overplayed the case. Using a do block instead of an anonymous block makes it look more complicated than it is. Even wrapping a local var in a bare block is rarely necessary. Most code is nested at some level in a if or while or other loop block or subroutine body. On the rare occasions that it is at the top level of a program or module, if you really want it to be garbage collected, undef is better (in that it will actually achieve something) anyway. Even the use of a constant is a emphasising the rare case. Mostly data is read in from external sources and is in a variable already, so: `while( my $var = <$fh> ) { mutate( $var ); use( $var ); }` [download] is hardly onerous, but even that can be avoided. Thanks to perl's context sensitivity, you can have the best of both worlds. For the simple case, subroutines behave as passthru pass-by-value, but when the need arises to minimise memory allocation and copying, using it ina void context does the right thing: `#! perl -slw use strict; sub mutates { my $ref = defined wantarray ? \shift : \$_[ 0 ]; $$ref =~ s[(?<=\b[^ ])([^ ]+)(?=[^ ]\b)][scalar reverse $1]ge; return $$ref if defined wantarray; return; } sub doSomething { print shift; } doSomething( mutates( 'antidisestablishmentarismania' ) ); my $var = 'The quick brown fox jumps over the lazy dog'; mutates( $var ); doSomething( $var ); __END__ c:\test>junk ainamsiratnemhsilbatsesiditna The qciuk bworn fox jpmus oevr the lzay dog` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: No garbage collection for my-variables by moritz (Cardinal) on Sep 16, 2008 at 22:36 UTC
Re^5: No garbage collection for my-variables by BrowserUk (Patriarch) on Sep 17, 2008 at 00:06 UTC
Some notes below your chosen depth have not been shown here
Re^3: No garbage collection for my-variables by Porculus (Hermit) on Sep 16, 2008 at 21:57 UTC
Agreed a thousand times over. If I had a penny for every time I'd been forced to write tedious and ugly code because `chomp` modifies its argument instead of returning the chomped version, I'd have several pennies.	[reply] [d/l]
Re^4: No garbage collection for my-variables by ikegami (Patriarch) on Sep 17, 2008 at 03:02 UTC
Re^5: No garbage collection for my-variables by vrk (Chaplain) on Sep 17, 2008 at 08:20 UTC
Re^2: No garbage collection for my-variables by betterworld (Curate) on Sep 16, 2008 at 18:51 UTC
Maybe it's time for the fabled use less Good point. Maybe there just isn't a way for perl to detect how a particular variable could be optimized, but it would be possible if the user could decide. things like your examples of encode and decode that take string and return it modifed in some way, ought to be written to use the aliasing pass-by-reference aliasing affects of @_ anyway. Unfortunately I don't think it's realistic to demand that all modules be written this way. In the case of Encode, I'd rather use the module than my own memory-conserving code; and it's not convenient to change the module's source code. (I would probably even have to change it if "use less" worked, because it's lexically scoped afaik.) (However I could encode the text line by line as Joost suggested.)	[reply]


Perl Monk, Perl Meditation
	PerlMonks