in reply to Force perl to release memory back to the operating system

On the assumption that an RDBMS isn't in the cards, let's look at where you're chewing through memory, and see if we can't figure out a way to get by with less.
$balances{$cust} += $bal * $fx_rates->{$ccy};
is central to what you're doing. Unless there's some heuristic you can apply to exclude some customers (i.e., "forget these, since they'll never be anywhere near the top N"), you may be stuck here. How many customers do you have?

Moving on,

my @sorted_cust_list = reverse map {$_->[0]} sort {$a->[1]<=>$b->[1]} map{[$_,$balances{$_}]}keys %balances;
is the second hit, and it's a big one. First, keys %balances builds a big array, then you build another one via sort, then another via reverse. (Not all of the arrays are live all the way through the pipeline, but still...) There's a way around this, though. Instead of using a Schwartzian transform, make a low-footprint pass through the hash, using
while ( my($cust,$balance) = each %balances ) { ... }
and keep a smaller data structure that records the top N customers by balance. (I'll leave the choice of algorithm as an exercise for the motivated reader.) This is a bit more work, but it might run faster given the reduction in memory demands and attendant reduction in swapping/thrashing.

At this point, the big memory sink, assuming you have lots of customers, is %balances. You can undef this at the end of the subroutine, but, as far as I know, this won't result in Perl releasing memory back to the OS (and it goes away at the end of the routine anyway). The last time I looked into it, which was a few releases of Perl ago, Perl indirectly used a "raise the watermark if necessary" memory allocation strategy, with no provision for ever lowering the watermark.

Replies are listed 'Best First'.
Re: Re: Force perl to release memory back to the operating system
by Roger (Parson) on Sep 25, 2003 at 12:35 UTC
    The first part of the collection process -
    $balances{$cust} += $bal * $fx_rates->{$ccy};
    calculates the total balance for each customer, where each customer can have transactions in multiple currencies. I have thought about applying certain heuristic, unfortunately I can't make assumptions about customer behaviours (not to mention the consequence of producing an approximate report).

    I am stuck with storing balances in memory. I don't want to export these balances into temporary files, because that will introduce significant penalties on storing 10+ million individual accounts to disk and reading them back in again. I will get rid of the memory hungary Schwartzian transform and use a sorting technique that will be balanced on memory usage and speed.

    At the end of the selection process, I will have a list of top 30 customers. Come to think about it, I will probably join the 30 customer info in a string of comma separated values, and restart the perl script again with the exec { $^X, $0, @ARGV } technique.

    I will go to sleep tonight with these ideas in my mind, I will probably get enlightened in my dreams. :-)

    And thanks again for your suggestions.
      Have you thought about using a tied hash? DB_File would work perfectly since you are using a large hash that is bigger than RAM. The Berkeley DB library handles all the nasty details of saving the data that won't fit in memory. In addition, DB_File supports in-memory databases that back to disk so you wouldn't even have to worry about creating and deleting a temp file.

      Since you are selecting a limited number of the highest values, keeping a list of the highest values see so far while scanning through the entire hash will be much more efficient than operating on everything at once.