Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Tracking down memory leaks

by scain (Curate)
on Apr 13, 2005 at 12:18 UTC ( [id://447341]=perlquestion: print w/replies, xml ) Need Help??

scain has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have an application that parses largish tab delimited files (typically a few million rows) and writes out files suitable for bulk loading into a postgres database. Unfortunately, it seems to leak memory; after processing a few 100,000 lines, the processes starts to crawl, as swap gets used up and not long after that, the OS starts killing things.

When I realized that I had a memory leak, I figured I knew exactly where the problems was: I was using a few hashes to keep track of IDs that I was generating for a few of the tables. Simple enough--I tied those hashes to DB_File and the problem should be solved, right? Well, if it were, I wouldn't be writing this, would I?

Heading off to cpan, I found Devel::Leak, Devel::LeakTrace, Devel::Leak::Object and Devel::ObjectTracker. Unfortunately again, none of these seem to help much. Devel::Leak is quite low level, and so I don't really know how to use it. Ditto for Devel::LeakTrace. If anyone has suggestions for how to use it for finding the leak, I'm all ears/eyes.

Devel::ObjectTracker seemed promising. It is well documented and several configurable options, including allowing it to track hashes and arrays in addtion to objects. Unfortunately (I am getting tired of using that word), it seems to ignore all the settings for the options, and doesn't print anything other than table column headings in its output.

Devel::Leak::Object seemed to also have promise. When using it, it overloads the bless function to allow tracking of creation and destruction of objects. While this worked, it only showed two objects remaining at the completion of the main loop: one is a small config-fetching object that I could easily destroy before the loop, but there is no way that is leaking, and the other is the object that handles IO, and the author of that object insists that it can't be leaking either (there are no package variables in the object).

So, if anyone has suggestions for where to go from here, I'd really appreciate seeing them.

thanks much,

Scott
Project coordinator of the Generic Model Organism Database Project

Replies are listed 'Best First'.
Re: Tracking down memory leaks
by perrin (Chancellor) on Apr 13, 2005 at 12:34 UTC
    Growing and leaking are not the same thing. A perl program can use more memory after running a while even if nothing is wrong. For example, if you load a 10MB file into a scalar, that scalar will hang onto that memory, even if it goes out of scope. You would have to explicitly undef it to get the memory back.

    So, your real question is "How can I make my program use less memory?" There are some answers to that in general terms in the Perl documentation and other places. For specific advice, try to narrow down a small section that grows a lot over time, and post it here for help.

      A perl program can use more memory after running a while even if nothing is wrong. For example, if you load a 10MB file into a scalar, that scalar will hang onto that memory, even if it goes out of scope.
      Shouldn't these situations be handled by garbage collection. If a scalar (or array/hash) gets out of scope (eg. subroutines internal variables) they should be freed when memory is needed before asking memory from system?

      AFAIK memory asked from system won't shrink.
        Shouldn't these situations be handled by garbage collection.
        Maybe, but mostly they aren't, for performance reasons:
        sub bla { my $arg = shift; my $big_string = $arg x 1000; }

        Perl will in general keep the memory for $big_string allocated and reserved, because then it doesn't need to allocate the memory again next time the sub is called.

        Explicitly undef()ing or resizing variables before they go out of scope sometimes helps, though - on some systems, it might even free the memory back to the system.

        Usually, you don't need to do this, exactly because the memory gets reused anyway. If your program grows a lot, it's more likely you're using an inefficient algorithm, or you're creating circular references somewhere.

        That's what I was thinking too. The fact that memory usage slowly grows until the OS kills it is not good. The files I am processing have fairly short lines and it processes the files line by line in a loop. All of the variables inside that loop should be "reusing" the same space, right? Otherwise, what is the point of scoping at all.

        Scott
        Project coordinator of the Generic Model Organism Database Project

        No, the memory is not freed. Perl keeps it as an optimization since you would need to allocate it again the next time this chunk of code runs. Of course that doesn't help you any if that code doesn't run again in your program... We have discussed this at length on the mod_perl list and it is covered in the mod_perl docs.
Re: Tracking down memory leaks
by dragonchild (Archbishop) on Apr 13, 2005 at 12:57 UTC
    Are you doing something like:
    my @outfile; while (<INFILE>) { # Process stuff here. push @outfile, $newline; } foreach my $line (@outfile) { print OUTFILE $line; }

    Why not just do something like:

    while (<INFILE>) { # Process stuff here. print OUTFILE $newline; }
    That code template will grow to a fixed size and stay at that size, regardless of how many lines are in INFILE or OUTFILE.
      I can't imagine code like that working in this application since the files can be quite large. Here is an outline of how the script works:
      • Prepare several DB SELECT statement handles that will be used inside the loop to get useful information.
      • Create tied hashes for caching that information so that I don't have to hit the database everytime I need the id of some frequenly used term.
      • Create an IO object that will parse the file line by line and hand back information about the line in an OO way.
      • Loop using the IO object's next_feature method. Do lots of bookkeeping using the tied hashes. Write output to several (about 10) files that will later be loaded into postgres using COPY FROM STDIN.
      • Close open files, destroy DB statement handles, and load data into database.

      Scott
      Project coordinator of the Generic Model Organism Database Project

        Create tied hashes for caching that information so that I don't have to hit the database everytime I need the id of some frequenly used term.
        Did you benchmark this? Repeatedly asking the database for the same thing might not be so bad if your database is good in caching. But tied hashes in Perl are slow. There are many factors involved, and what's best will vary from setup to setup, but don't dismiss something for tied hashes too easily if it's performance you care about.

        Of course, this has nothing to do with your memory problem.

        Create tied hashes for caching that information so that I don't have to hit the database everytime I need the id of some frequenly used term.

        And you're wondering why your memory usage is increasing? Why don't you try a run with caching disabled and see if that fixes your problem.

Re: Tracking down memory leaks
by Anonymous Monk on Apr 13, 2005 at 12:55 UTC
    Finding a memory leak can be more an art than following a set procedure. Finding a memory leak usually involves careful analysis of your program, often aided by lots of debugging output - either from modules, or carefully placed sections in your code.

    Of course, the leak may also be in perl, or in a library perl is using. Or in a module your program is using.

Re: Tracking down memory leaks
by scain (Curate) on Apr 19, 2005 at 02:46 UTC
    I found the source of the leak. It turned out that Devel::Leak::Object was useful, if I had only taken one more step. With a small set of sample data, I ran the script and then used Data::Dumper to dump out the IO object that the author assured me didn't leak. Well, as it turns out, it was caching every line of the file that it saw, plus a fair amount of derived data that it calculates for each line. The somewhat annoying thing is that it doesn't actually use 99% of that cached data for anything (one item per line does need cached). That IO object has been fixed. While it will still use some memory, it will significantly raise the that amount of data that can be processed before problems are encountered.

    Thanks to all of you for your suggestions,

    Scott
    Project coordinator of the Generic Model Organism Database Project

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://447341]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-24 06:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found