soliplaya has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have come seeking wisdom on what is happening to my quite large and quite memory-hungry perl program.

This program works only under Windows, because it uses some external programs that work only there. To describe the program in general terms : it runs as a daemon (Service), and spends it's time scanning a series of directories in search of new documents. Each time it finds one, it grabs it, processes it in series of steps, and delivers the results in a destination directory. Then it loops and does the same again. It is supposed to do that hundreds or thousands of times a day, each loop being one document.

Processing a document involves many steps, calling many CPAN modules and some external programs

I am watching the program run with the Windows Task Manager, and I see it gradually inflate the quantity of memory it is using. It starts at about 20MB, then in irregular little chunks of a few K, it grows over time to 120 MB (it grows a bit almost each time it processes a document - but not always, not systematically, and not by the same amount each time). I will see it gain 5, 10, 15 30 K, then going back down 20 K, the growing again 5 K, etc.. but eventually settling for somewhat more than it had at the start of the loop.

In the end it grows so much that, when I stop the program (nicely), I see that right before it's exit statement, it spends a while more (5 minutes e.g.) growing even more, in chunks of several K at a time, before starting to deflate in bigger chunks and finally exiting after some 5 minutes of cleanup. I suppose the last bit is Perl's final DESTROYs and cleanup before exiting, and there must be a lot of detroying going on.

I have tried to trace where the gradual inflation is happening, by means of carefully inserted "prompt" instructions, which enable me to peek at the Task Manager's indication of memory used at each significant step, before letting the program continue. I found that way that at one point, where the program is parsing for instance a large OpenOffice document, the memory used can increase quite a lot, to decrease quite a lot right after the parsing. But it does not explain everything, and some memory increases (or decreases) seem largely random to me (for example without connection to the size of the document being processed). It even happens at some point, that the program uses say 100 MB of memory, and suddenly decreases to 60 MB, whereafter it starts growing again in small increments. There is nothing in the logic at that point that would explain why it suddenly reclaims 40 MB at that point, rather than at some other moment. I also see that it is not the calling of the external programs that causes the memory increases.

I have also tried to be somewhat "clean" in the amount of dangling or circular references I'm using. But it is a big and complex program, and I am not such a hotshot programmer, so it's quite possible there are some of these things lurking around. I have of course little control over what some CPAN modules might do internally.

I will add that I have carefully read all the scriptures I could find about perl's memory management, GC, reference-counting, circular references etc.., and of course when I talk above of the memory used by my program, what I see in the Task Manager is the total memory used by the process, perl interpreter included. So I generally understand what I am seeing. But what I would expect, is that my program gradually increases the memory it is using, more or less in relation to the largest document it has processed so far, but than at some points stabilises. But that is not what I am seeing : it seems to grow slowly, but persistently and seemingly without real limit.

The point of this long rambling is : how would one go about finding out exactly what is happening, and possibly correct it ?

Thanks in advance for any light

Replies are listed 'Best First'.
Re: memory (leaks and management)
by dk (Chaplain) on Mar 28, 2007 at 08:48 UTC
    If you're sure that your program never stops growing, then you can modify it so it runs document processing in a tight loop, over the same or different documents, and see if memory usage becomes more aggressive. If yes, you can use Devel::Size if you suspect particular structures, or in general, comment parts of your program until it stops leaking. If we're talking about "finding exactly", this is the only sure method I know, that allows to narrow most of memory leaks ( in finite time :)

    I don't think the leaks can be attributed to Perl, unless your program indeed creates circular references, but then it is the program logic that has to be fixed. More probably there's a 3rd party module that is leaking, in which case you'll locate it by reducing your program more and more until all you have is several lines of code, a reproducible showcase. You might want to send it further to developers (and find out that the bug was fixed several versions back, can happen too :) Therefore, before even you begin the hunt, try installing all the latest modules and see if the problem persists.

    Good luck!

      Thanks. I will have a look at Devel::Size.
      Other than that, not necessarily in order of relevance :
      - Most of the modules are up-to-date, or else modules I have used before without noticing leaks of this magnitude
      - The program is difficult to whittle down, as one step generally depends on the results of the steps before. For instance, the program takes a MS-Office document and converts it to OpenOffice in one step, then in the next step uses the OpenOffice version to extract the text. So I cannot really strip the first part and run the second only.

      I have noticed the following funny thing : at some point, I run an external utility using system(). It can take a while to run, during which my program waits for the return of the system() call. Well, sometimes during that wait, the memory used starts decreasing, as if Perl was using the time to do some cleanup. Is this possible, or am I having hallucinations ?

      A possibly related question : when exactly does Perl decide to embark on a cleanup round ? and is it possible to provoke it ?

        I don't think it is Perl, (well yes, perl does call free() once in a while), but rather win32 memory manager that keeps memory for the process for a while.

        As for whittling down, I'd suggest being a bit more creative :) for example, yes indeed, you cannot take MS-to-OO transition out, but you can pre-create the OO documents that correspond to the MS ones, and change the program so that instead of running the actual conversion, it loads the corresponding OO doc from the disk. The same is valid for OO-to-text - comment out the OO part, but save the text separately and use in in the reduced program.

        hope this help, /dk

Re: memory (leaks and management)
by bart (Canon) on Mar 28, 2007 at 10:36 UTC
    Move the processing of the documents out of the daemon. Make it a separate script, that you call from the daemon. That probably should fix most problems... And if it doesn't, you've got a lot smaller problem space to dig into.
      yes remember, garbage collection only comes in to play when an object goes out of scope or has no references. so if you have a global array say where you push data then it will never be released.
Re: memory (leaks and management)
by BrowserUk (Patriarch) on Mar 28, 2007 at 10:54 UTC

    I've done very little with Win32::OLE and it was several years ago, but I do recall that it leaked quite substantially when called repetitively. From memory, the leaks were deep inside the OLE code and beyond the control of the user programs. Each new OLE object created never quite seemed to clean up all it's memory when it was undef'd.

    My suggestion to you would be to split your code into two parts. Leave the directory monitoring code in the service and move the OLE code into a separate script. When the service detects a file for processing, it invokes the other script with the name/path of the file as an argument. When the other script terminates, it will clean up it's own memory and bypass the problem.

    This also open the possibility of using system] 1, qq[yourOleScript.pl $theFile2BProcessed]; which will 'detach' (run asynchronously), the OLE script and allow your service to respond more quickly to the next file by running multiple file concurrently.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: memory (leaks and management)
by perrin (Chancellor) on Mar 28, 2007 at 13:29 UTC
    There's some advice from the book "Practical mod_perl" which applies to any situation with a long-running perl interpreter. It may help.

    In general, growth in a perl program does not indicate a leak. Perl uses a lot of memory and will take more when it needs it and rarely give any back.

    Your attempts to find the line where the memory increased may have been foiled by the fact that Perl allocates memory in buckets. It grabs a fistful of RAM at a time so you only see the increases at the times when it runs out and needs to go back for more, not at the times when it uses the memory.

      I do not want to leave the impression that I am ungrateful. I am just busy putting all the above good advice to use.
      While not necessarily everything suggested is directly (or easily) applicable in this case, be assured that even the bits that are not so right now are not lost and will be remembered.

      At the moment, the program is still leaking, but I haven't lost hope. I have just tried the easiest bits first. I can see the wisdom of some of the more drastic recommendations, but some pills are bitter to swallow.

      Many thanks to all anyway.

      While I am digging, can anyone make an educated at guess at the phenomenon whereby, when I stop the program (properly, by a caught signal), and keep watching memory, said memory continues to increase (significantly) for a while, then starts decreasing until the program finally stops ? I mean, for instance does perl have to copy stuff in order to be able to destroy it ? The answer may possibly give me a clue as to what is going on.

        Perl is running your END blocks and calling the DESTROY methods on objects that it's cleaning up.
        I mean, if I can find out what it is that gets destroyed at the end, maybe I can figure out where it comes from.
Re: memory (leaks and management)
by jmcnamara (Monsignor) on Mar 28, 2007 at 08:31 UTC

    I'll side step the more general question on how to track down memory leaks for a moment to first ask which modules are you using and what types of files are you processing.

    That may give some hints to specific or known problems before addressing the more general.

    --
    John.

      Well, that was one of the reasons why I did try to keep to the general level, not to scare anyone off. The program processes pretty much anything, like Word, Excel, Powerpoint, OpenOffice, html, scanned document images, pdf, text, zipfiles, photos, audio, video, using a lot of modules (at the first level, never mind the ones those are calling) like, but not limited to
      # for starting OO2 as a subprocess
      use Win32;
      use Win32::Process;
      # i/f to FineReader
      use Win32::OLE qw(in valof HRESULT);
      # Win32::OLE->Option(Warn=>2, _Unique => 1); # for debug only
      use Win32::OLE::Variant qw(:DEFAULT nothing);
      use Win32::OLE::Enum;
      use Win32::OLE::Const;
      # Modules needed to parse various document types
      use HTML::TreeBuilder;
      use HTML::Entities ();
      use OpenOffice::OODoc;
      use Image::ExifTool;
      use MP3::Info qw(:all);
      
      I also use system() calls to run external programs.

      Now, before anyone gets excited on this or that module, I have used most of the above-listed modules before in another (quite different) incantation of the same program, with no apparent leaks like the ones I'm having now.


        The main reason I asked was because I thought that you may have been using Spreadsheet::ParseExcel to process Excel files and that module has known issues of this kind.

        However, that is not the case so back to your main question. :-)

        --
        John.

Re: memory (leaks and management)
by zentara (Cardinal) on Mar 28, 2007 at 13:13 UTC
    The first thing that comes to mind with memory gains, is to reuse all objects. When you go into your loops, don't create new objects, reuse ones that you have already created. Of course this is just general advise, and you may not be able to clear out some objects. In that case, you can set some global scalars to hold the objects, and when you create in the loops, assign them the same global names. With that system, your memory should be reused, and rise to the peak level of the biggest object, but not continually grow.

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum