Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

No garbage collection for my-variables

by betterworld (Curate)
on Sep 15, 2008 at 19:48 UTC ( [id://711531]=perlmeditation: print w/replies, xml ) Need Help??

Dear monks,

recently, I've learned something from ikegami in Re: out of memory problem.

Apparently, the memory that is (directly) occupied by my-variables is never automatically freed. The term "directly" includes the case where a my-variable holds a (very long) string.

As ikegami has demonstrated, a string buffer will grow when needed, but it will never (automatically) shrink or disappear. Apparently this is part of some kind of optimization to avoid constantly reallocating buffers for code that is used often.

I've shown that a buffer will only be reused for the same variable. This means that it will not be reused for a variable in another subroutine. This is were I see a big problem. In large programs there are thousands of lexical variables, and not all of them are used more than a few times, but all of them retain their buffers. Even for a buffer that it reused often, this is not optimal: Once it will have a large chunk of data, it will stay at that size, regardless how small the data is in the next calls.

And I've done my homework and done a super search. As a matter of fact, this topic has been discussed before (Garbage collection of 'my' variables, Re: Tracking Memory Leaks).

The commonly suggested workarounds are:

  • undef-ing variables after using them. (Actually this is not always practicable, e.g. in the case where you want to return the scalar)
  • Designing the code so that it works on references or aliases.

Alright, but the problem is that code is generally not designed like this. Of course, you can design your code this way if you plan to handle large data. However, almost all serious projects use external code that they haven't written themselves. In my search for the most obvious example, I found Encode.pm:

Consider this code:
#!/usr/bin/perl use strict; use warnings; use Encode; sub init { encode('utf8', 'x' x 100_000_000); return (); } print "starting\n"; sleep 5; print "initializing\n"; init(); print "initialized\n"; sleep 5; print "cleaning\n"; undef &Encode::encode; sleep 5;

If you watch this program's memory consumption, you'll find that it will use approximately 288MB after "initialized" has been printed. After "cleaning" has been printed, the amount will shrink considerably to 98MB. (Actually it will shrink even more if you wipe out the "init" subroutine itself, I guess this is because of the large string constant.)

Responsible is this code in Encode.pm:

sub encode($$;$) { my ($name, $string, $check) = @_; # ... my $octets = $enc->encode($string,$check); $_[1] = $string if $check and !($check & LEAVE_SRC()); return $octets; }

Both $string and $octets hold our dear string, and (like I) the author obviously thought that they don't need to free its memory.

I've named the subroutine "init" to suggest that this is code that will only be used at the very start of a long program lifetime, which means that the long string buffer will linger around needlessly.

So, what would I be supposed to do? Don't use Encode and do my character transcoding myself? Or should I actually use "undef" to clean all the subs that I have used? Consider that my initialization code loads an XML configuration file. I'd have to clean most of the namespaces of XML::Simple, XML::Parser and whatelse. And if I actually plan to continue using these modules, I'd have to wipe out "%INC", then require them again, not very nice. (Just an example; I've not really checked these modules, so please don't be offended if you are the author and have considerately undef-ed every variable.)

I've not looked for this optimization in the perl source yet, but I'd really like someone to explain why it is needed. I can agree that it would not be performant to do a lot of malloc/mmap/munmap/brk for every string that is copied, however IMHO there are situations where perl should find some way to realize that any of the following cannot be performant either:

  • Holding (several) SVs in memory that occupy lots and lots of memory pages
  • Holding like 1000 SVs in memory where neither has been used more than once

I'll conclude with an example snippet that demonstrates how you can eventually get your computer to use excessive amounts of memory or even swap:

perl -lwe 'my $code = join "", map { "sub foo$_ { my \$var = q(x) x 1_000_000; }" } 1..1000; eval $code; die if $@; for (1..1000) { sleep 1; "foo$_"->() }'

Because of the "sleep", you can run this snippet and watch it indulge itself by eating one megabyte per second (in GNU, use "top" and press M).

The snippet uses string-eval to generate a lot of subroutines like this:

sub foo1 { my $var = q(x) x 1_000_000; } sub foo2 { my $var = q(x) x 1_000_000; } # and so forth...
then calls them one after another.

Well, that was a large chunk of text now, I've tried to ease your reading by using bold text, I hope that perlmonks' buffers will eventually be freed from this text, and I hope that I haven't missed something obvious.

Replies are listed 'Best First'.
Re: No garbage collection for my-variables
by Joost (Canon) on Sep 15, 2008 at 20:13 UTC
    A program has only a limited number of lexical variables, but may process an unlimited amount data.

    It's the case anyway that for large strings (which is the only case we need to consider) it's much more efficient to pass around references. And code that expects to deal with very long strings generally does that, or encapsulates the strings in an object or deals with file handles directly.

    Copying 500Mb strings around would be stupid not just for memory reasons even if all the memory gets reclaimed when the variables holding them go out of scope. You really do want to pay attention to what you're doing when dealing with large chunks of memory.

    Perl is optimizing here for the cases where you want fast, repeated processing of strings no larger than say 10% of your memory. If you need to process larger strings, you'll have to pay attention anyway, and automatically clearing all scalars won't really help much (and it would dramatically slow down the general case).

    I don't see the current behaviour changing until someone completes a perl with a garbage collector instead of the current refcounting scheme. That would be perl 6, so it may take a while.

    update: I just wanted to mention that although all of this is interesting in a way, it's very unlikely that this behaviour has given you any actual problems. Just don't slurp in giant files, or Encode a whole dictionary in one call. What's wrong with reading and writing stuff line by line? That way, you can run thousands of those programs at once without any problem (or a couple at once, so as to actually use your CPU for something useful, instead of waiting for the drive to catch up).

      I don't see the current behaviour changing until someone completes a perl with a garbage collector instead of the current refcounting scheme.

      The OP is saying that you can allocate a large string, let the variable go out of scope, and the memory is not freed and not reused. The memory allocated to the variable "sticks" to it even if you never use it again. (If I have this wrong, betterworld, please correct me.)

      I don't see what garbage collection has to do with this. The strings in question don't have any references to them, so the reference counter shouldn't have any problem knowing that they're not in use.

      I don't know what method perl uses to grow strings. The general method I recall from my CS classes was to double the size of a string when it grows out of its buffer and halve it when it shrinks to less than a quarter of the buffer size. Maybe someone more familiar with the internals can shed some light on why that wouldn't be a good design choice for Perl.

        I don't see what garbage collection has to do with this. The strings in question don't have any references to them, so the reference counter shouldn't have any problem knowing that they're not in use.
        Reference counting has everything to do with it, since it means that the only time perl can free the memory is when the last reference to the scalar goes out of scope. All without knowing if that scalar is every going to be reused.

        That means it either has to keep it there always, or free it always (or do some kind of heuristic, which should usually mean keep it, since allocating memory is expensive, and if you're using a large string now, chances are, you'll be using a large string again some time soon).

        What perl currently cannot do, is free "old, unused" scalars when it's running out of memory. It has to decide when the scalar is going out of scope. allocating and freeing each scalar every time that happens would probably slow down the interpreter a lot.

        The strings in question don't have any references to them

        Not true. The pad that refers to them when the function is being executed still refers to them when the function isn't being executed.

        It could be changed to be true, so this nit pick is not relevant to the conversation.

Re: No garbage collection for my-variables
by zentara (Archbishop) on Sep 15, 2008 at 20:38 UTC
    You might be interested in OS memory reclamation with threads on linux. I'm no expert, but it seems that Perl uses some internal calculator, to determine when, and how much memory to free back to the system. It is clearly seen in the above node, where a memory-heavy thread is almost totally released back to the system, but with light-weight threads, it is held onto.

    I was musing the other day, that it would be a neat feature to have a "forced demalloc" on threads, where you could specify an option to free all memory used by a thread once it's done, damn the refcount. I would like that option, as it would then be easy to reclaim memory just by putting it in a thread, and specifying "free_all". Possibly warnings may be issued, but another "no warnings:free_all" could be used.


    I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: No garbage collection for my-variables
by repellent (Priest) on Sep 16, 2008 at 02:31 UTC

      Hmm, that FAQ answer could do with some code:

      exec( $^X, $0, @ARGV ) or die "Can't execute self so killing self: $!\n";

      - tye        

        Nice!

        Also remember: don't shift your @ARGV ;-)

        But seriously, wouldn't it be more involved since we need to consider saving the program "state" and resume it somehow?
Re: No garbage collection for my-variables
by CountZero (Bishop) on Sep 16, 2008 at 16:25 UTC
    ++ for this very interesting post.

    I guess it is all a matter of design choices. The my variables are not specially made for the purpose to release memory when they go out of scope. May be erroneously some of us may have thought/hoped/wished they were, but as you clearly showed, they are not.

    Rather their design is to encapsulate them within their scope and not "pollute" the variable-namespace outside of it. And they do that very well.

    As they are mostly used in loops and sub-routines, there is a good argument to be made in favour of speed above memory-consumption. You really do not want your tight running loop to get slowed down by repeated (de-)allocation of memory!

    Also, I have been using Perl for many years now and never got any of the memory-issues you mention. The examples you give are correct, but --IMO-- marginal or degenerate situations. Still we should not be blind for these issues and if you have a memory hungry program, you indeed may have to program very careful so as not to exhaust your memory. Thank you for reminding the Monastery of this!

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: No garbage collection for my-variables
by BrowserUk (Patriarch) on Sep 16, 2008 at 18:01 UTC

    Maybe it's time for the fabled use less to allow this memory-for-speed optimisation to be disabled?

    That said, most of the types of routines for which this could become a significant problem, things like your examples of encode and decode that take string and return it modifed in some way, ought to be written to use the pass-by-reference aliasing affects of @_ anyway. It would make this 'problem' go away.

    Of course, an orthodoxy has grown up around this place that pass-by-reference and side-effects are some how bad karma and that directly accessing @_ is premature optimisation. That modifying your arguments is bad because it is action at a distance that can surprise the caller.

    But, as long as subroutines are documented as modifying their argument(s), it really does make the most sense in many cases. The caller knows what subsequent use it will make of the arguments it passes you, and if it needs for them to be preserved, it can make copies as and when it needs to. Which makes more sense than every subroutine, copying every parameter, every time, 'just in case'.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      In addition to moritz's excellent point that a function that modifies its arguments then could not be called with a literal, I'd also point out that a lot of Perl programmers probably don't know that @_ is full of aliases. I'd been programming in Perl off and on for over ten years before I came to the Monastery and learned that @_ is aliases. I've asked about this feature in interviews I've conducted, and the prospects out there have always been surprised at this feature. Documentation helps, of course, but someone who doesn't know this is possible could spend an awful lot of time debugging before discovering this (as you say) action at a distance.

      Thumbs up on the use less, however.

        that a function that modifies its arguments then could not be called with a literal

        There are edge cases. See foreach funny business.

      There's much more perlish reason not modify the arguments of sub by default. If you don't, you can write stuff like this:
      other_function(decode 'latin-1', 'string_literal')) # and if you want to change a variable $var = decode('latin-1', $var);

      On the other hand if you do change the the arguments of the sub, the first one requires another variable, which is a real kludge (visually, at least)

      do { my $var = 'string_literal'; decode('latin-1', $var); other_function($var); } # and the other one decode('latin-1', $var)

        I think that you've overplayed the case. Using a do block instead of an anonymous block makes it look more complicated than it is.

        Even wrapping a local var in a bare block is rarely necessary. Most code is nested at some level in a if or while or other loop block or subroutine body.

        On the rare occasions that it is at the top level of a program or module, if you really want it to be garbage collected, undef is better (in that it will actually achieve something) anyway.

        Even the use of a constant is a emphasising the rare case. Mostly data is read in from external sources and is in a variable already, so:

        while( my $var = <$fh> ) { mutate( $var ); use( $var ); }

        is hardly onerous, but even that can be avoided. Thanks to perl's context sensitivity, you can have the best of both worlds. For the simple case, subroutines behave as passthru pass-by-value, but when the need arises to minimise memory allocation and copying, using it ina void context does the right thing:

        #! perl -slw use strict; sub mutates { my $ref = defined wantarray ? \shift : \$_[ 0 ]; $$ref =~ s[(?<=\b[^ ])([^ ]+)(?=[^ ]\b)][scalar reverse $1]ge; return $$ref if defined wantarray; return; } sub doSomething { print shift; } doSomething( mutates( 'antidisestablishmentarismania' ) ); my $var = 'The quick brown fox jumps over the lazy dog'; mutates( $var ); doSomething( $var ); __END__ c:\test>junk ainamsiratnemhsilbatsesiditna The qciuk bworn fox jpmus oevr the lzay dog

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Agreed a thousand times over. If I had a penny for every time I'd been forced to write tedious and ugly code because chomp modifies its argument instead of returning the chomped version, I'd have several pennies.
      Maybe it's time for the fabled use less

      Good point. Maybe there just isn't a way for perl to detect how a particular variable could be optimized, but it would be possible if the user could decide.

      things like your examples of encode and decode that take string and return it modifed in some way, ought to be written to use the aliasing pass-by-reference aliasing affects of @_ anyway.

      Unfortunately I don't think it's realistic to demand that all modules be written this way. In the case of Encode, I'd rather use the module than my own memory-conserving code; and it's not convenient to change the module's source code. (I would probably even have to change it if "use less" worked, because it's lexically scoped afaik.)

      (However I could encode the text line by line as Joost suggested.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://711531]
Approved by Joost
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 16:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found