the_slycer has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I've written a relatively complex program that makes use of the excellent ARS module to query a remedy server for ALL tickets modified on the previous day. Once it has the entry numbers for the tickets, it runs another query for each ticket to get to the actual data in the tickets. This is relatively easily accomplished using the above module. The problem comes in the way that the data is stored.

Basically each entry is returned as a hash of many different fields from the schema. Some of these fields are simple single string fields, others are returned as an array reference, and yet others are returned as a hash reference. Some of the fields that are returned as a list are often a list of hash references. Again, it's relatively easy to run through this and grab the data that I need. The problem that I am running into relates to the amount of memory that this chews up.

A typical day will have around 9000 records. Some of the fields will contain about 3k of data, I would guess that the total record for each entry runs about 5k on average. I'm using about 2k of the 5k per entry. However, the script is using at times up to 200 meg of memory. The server admins have come back and said that this is too much memory being used, and they need to find some other solution. At this point, I turn to you my brethren and ask for assistance. How can I get this down, preferably maintaining a balance between CPU (which is currently very low) and memory usage (which is currently very high).

The only thing that I can think of right off the top of my head is to spawn a new process for each ticket that I want to get the data from and run the query for the ticket data in that process, but I think that this is going to have some severe performance issues, a better balance is desired. If you would like to see the code, please let me know.
  • Comment on Memory usage & hashes of lists of hashes of lists

Replies are listed 'Best First'.
Re: Memory usage & hashes of lists of hashes of lists
by dragonchild (Archbishop) on Sep 28, 2001 at 01:24 UTC
    Instead of storing all the data for all the tickets before you process them, why not get the ticket numbers, then loop over them and actually process each ticket within the loop. Store the actual ticket data within a lexical. Each ticket will then use the same physical memory location, thus keeping your overhead low.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      I am processing them one by one, it appears that perl keeps chewing space anyways. Here's the relevant section of code:
      foreach (sort keys %tickets) { my $ticket = ((split " ",$_)[0]); my %tick_inf; #hash to hold ticket info (%tick_inf = ars_GetEntry($rem,$schema,$ticket)) || warn "Could not retrieve $ticket: $ars_errstr"; #populat +e the hash with the ticket info my $field_data=$ticket; foreach (@$afield_names){ #these are the fields that we have be +en asked to grab. $field_data .= " " . $sep; #join the previous d +ata with the seperator unless (exists $hfield_names->{$_}) { print "Bad field name -- $_\n"; next; } unless (exists $tick_inf{$hfield_names->{$_}}) { print $hfield_names->{$_} . "\n"; print "Bad ticket info identifier -- $_ does not relate to +: " . $hfield_names->{$_} . "\n"; next; } unless ( defined($tick_inf{$hfield_names->{$_}})) { $field_data .= " "; next; } if (ref($tick_inf{ $hfield_names->{$_} }) eq "ARRAY") #For +array references { my $data; foreach (@{ $tick_inf{ $hfield_names->{$_} } }) #foreach o +f the values of the array ref { $data .= expand_hash($_); #it's always a hash, +so expand it out (sub returns a string) } $field_data .= $data; next; #and hit the next value } if ($d_fields->{$_}) #For dates { $field_data .= format_date( $tick_inf{ $hfield_names->{$_} } ); #format the date (sub routine) next; #and hit the next value } if ($l_fields->{$_}) #if the field is a sel +ection value { $field_data .= $l_fields->{$_}->{ #change number i +nto the name: $tick_inf{ #fill the data from the $lis +t field hash $hfield_names->{$_} # } # }; next; } $field_data .= $tick_inf{ $hfield_names->{$_} }; # #value is in the ticket }#close foreach @$fields #more stuff which gets rid of $field_data }
      UPDATE: I added a } to properly denote the end of the foreach (@$afields) loop. It is redefined at the beginning of each ticket (foreach sort keys %tickets) {}. Sorry about the bad formatting, this is what happens when you cut and paste code. The problem is not in the $field_data (IMHO)

      Note that before running this foreach loop the memory usage is relatively low (~40 meg). As this portion of the script runs the memory continues to increase (not an increase, then drop, then increase again). After this foreach loop we dump $field_data to a file and "undef" it.

      So, is there a bug that I am not seeing in there? Is there some reason that this code is chewing up so much space?
        You are storing all the information in some datastructure. It just doesn't happen to be a HoLoHoL that you mentioned earlier. It's that string $field_data. Processing each ticket individually means that when you're done with the loop, there is nothing in memory but what you had before you entered the loop. :-)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        Why don't you try writing to the file as you go and clearing $field_data, rather than storing it all up until the end? That could be the problem.
        Do you get the same/similar memory usage with just a call to ars_GetEntry in the loop and no other processing? If not, then you're losing memory somewhere else in the loop.
Re: Memory usage & hashes of lists of hashes of lists
by toma (Vicar) on Sep 28, 2001 at 07:29 UTC
    You are getting about the efficiency that I would expect for a HoLoHoL. As a rule of thumb, an array is about 66% memory efficient. This efficiency comes from the doubling algorithm for allocating array memory.

    I don't know but I am guessing that a hash has about the same efficiency.

    The problem is that a deep memory structure multiplies these inefficiencies together. So you have
    9000 records * 5k /.66/.66/.66/.66 = 237.2meg

    Technical solutin 1
    One solution is to use a flatter data structure. If you use a single-level hash with a key of Rem|Schema|Ticket it will use much less memory. Of course it will also require more code and more cpu.

    Technical solution 2
    Presize the arrays so that they don't allocate so much memory. This is easy for simple arrays but I have not seen it done for deeper data structures.

    Political solution 1
    It sounds to me like your server admins need to find more important work than making you save 200 meg of memory. Perhaps that issue is best saved for when you get asked for comments on their employee evaluation :-). More likely is that you have created a really fast application that is embarrassing the server people, and they want you to slow it down so their solutions don't look so bad in comparison. So they complain about a trivial amount of memory. If you offer to teach the server admins perl, they may stop complaining. It worked for me once!

    It should work perfectly the first time! - toma

Re: Memory usage & hashes of lists of hashes of lists
by perrin (Chancellor) on Sep 28, 2001 at 01:28 UTC
    Are you loading the tickets one at a time? If not, you probably should be. If you are, it sounds like you aren't cleaning things up properly and should post some code for people to look at and find the bug.

    If it's difficult to get the data one record at a time, you should consider which is more expensive: a little more RAM, or a couple days of your programming time. Sometimes resource efficiency isn't the most cost-effective solution.

Re: Memory usage & hashes of lists of hashes of lists
by the_slycer (Chaplain) on Sep 28, 2001 at 19:07 UTC
    Ok, another day of messing around with this has produced some rather bizarre results.
    I've stripped the code down to a "bare minimum", and am still running into issues:
    foreach (sort keys %tickets) { my $ticket = ((split " ",$_)[0]); my %tick_inf; (%tick_inf = ars_GetEntry($rem,$schema,$ticket)) || warn "Could not retrieve $ticket: $ars_errstr"; ++$count; print "$count\n"; next; #all the other stuff from yesterday }
    Note that I added the $count lines in there.

    The results show that we are still chewing up huge amounts of memory, but at two different times. I ran this with a small subset of entries (about 700). At the start of the foreach loop, the memory usage steadily climbed to about 225 meg, about 100 tickets had been retrieved, then it started to drop, and ran the rest of the time at about 50 meg used.

    What REALLY astonished me was at the end of the script. The last lines that I have in the script (immediately following the subroutine calls to the sub that this loop is in) read as:
    print "Script ended on " . localtime(time); print "\n";
    But for some reason, once I saw the print-out saying that it ended, the memory usage jumped BACK up to 225 meg, and the script took another 2 or 3 minutes to exit. WTF is that? Clearing the buffer or something along those lines?? There is honest to god NOTHING else after those lines above, yet I clearly saw the output before the memory usage jumped back up.

    The first portion above is pretty consistent with what I was seeing previously, the second thing to me is really bizarre. -- HELP -- :-)