vlakkies has asked for the wisdom of the Perl Monks concerning the following question:

I have a HUGE array $data[$i][$j] where $i is typically 0..99999 and $j is typically 0..999 or so. The problem is that garbage collection takes several minutes, so the program is very slow to terminate. Doing a
foreach $i (0..whatever) { $data[$i] = (); }
helps the garbage collector work faster by about a factor of 2 or three. Switching the order of $i and $j seems to make things slower. It seems that there ought to be a better way. What can I do to help out the garbage collector do this faster?

Replies are listed 'Best First'.
Re: Helping the garbage collector
by maverick (Curate) on Mar 14, 2002 at 23:24 UTC
    That's a LOT of stuff to keep around in memory...and I've never needed that many array elements. Could you redesign the program somehow so that you don't have to have all 100,000,000 entries in memory?

    Could you change your processiong to a line at a time filter?

    Could you maybe use a database of some sort to store this data for you? (they're designed to deal with this sort of stuff)

    What's the goal of the program? maybe we could give you some better help?

    /\/\averick
    perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"

Re: Helping the garbage collector
by perrin (Chancellor) on Mar 14, 2002 at 23:28 UTC
    If you don't need destructors or END blocks called, you can skip it altogether. Use POSIX:_exit($status). See perldoc -f exit for more.
Re: Helping the garbage collector
by shotgunefx (Parson) on Mar 14, 2002 at 23:23 UTC
    Have you tried undef on @data?
    undef @data;

    As an aside, minutes still seems like a looong time.

    -Lee

    "To be civilized is to deny one's nature."
Re: Helping the garbage collector
by vlakkies (Initiate) on Mar 15, 2002 at 01:30 UTC
    Thanks for the suggestions I got so far. I'm doing this in memory to clean up the data (nice and fast) before dumping it to a database. Here are some timing results for a real example with $i = 0..21448 and $j 0..91 on an Athlon 1.2GHz with 1GB RAM. Timing is for creating the array (about 5 sec) and the rest of the time goes into garbage collection. Total size of this process is about 112 MB, so about 10% of machine RAM. CPU usage is about 99+ percent and no swapping. (This is a small example that I can test in finite time :-( )
    undef @data;
    This takes 2 min 56 sec Now do
    for (my $i=0;$i<@data;$i++) { undef @{$data[$i]); }
    This takes 1 min 39 sec Finally do
    for (my $i=0;$i<@data;$i++) { $data[$i] = (); }
    This takes 43 seconds. Adding  @data =() to the end of the last example really makes no difference performance wise.
Re: Helping the garbage collector
by Ryszard (Priest) on Mar 15, 2002 at 02:59 UTC
    Have you considered using benchmark to time your results?

    I reckon there is a better technique of parsing your elements than loading them all into memory, processing, then collecting the garbage.

    Where is you data coming from?

  • If its dbi, you could do some funky chicken with selects and loops.
  • If its a ff, you could parse one line at a time, process it, and stick it on a stack to be whisked off to your db. (Not sure if the overhead of many I/O's would take more time than processing 100,000,000 rows in memory)
  • If your data is sparse, perhaps you could do some "preprocessing" on the source before the "real" processing begins.
  • What about fork or more scripts if you have multiple datasources?
  • If you are collecting your garbage just before your script ends, you may not need to as perl will free the memory when it ends!
  • Are you collecting the garbage to increase the performance of your DBI calls? (ie pages to disk) - is there a trade off to be considered here?

    Dont know if this will help but as they say (and I am more than aware of) - "You cant think of everything all the time."

    HTH

Re: Helping the garbage collector
by Juerd (Abbot) on Mar 15, 2002 at 06:31 UTC
        $data[$i] = ();

    In that line, you assign an empty list to $data[$i], which is a scalar. The empty list has to be "converted" to undef, so I guess that takes a little time. With this solution, all values are deleted, but they still exist. If that's what you want, for example if you want to re-fill the array, use: $_ = undef for @data; If you want to clear the entire array, use: @data = (); What you probably meant to use is:

    for (@data) { @$_ = (); }
    hth

    U28geW91IGNhbiBhbGwgcm90MTMgY
    W5kIHBhY2soKS4gQnV0IGRvIHlvdS
    ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
    geW91IHNlZSBpdD8gIC0tIEp1ZXJk
    

    Edit by dws for tag cleanup