Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a long running script that needs to cache some data for speed reasons, but after a while it starts to consume huge amounts of memory. As I look through it's memory usage I have narrowed down the overhead on small arrays as a major culprit. The data in question is stored as a number of 6 element arrays. Using Devel::Size::Report, I get the following output:
Array ref 440 bytes (overhead: 232 bytes, 52.73%) Scalar 56 bytes Scalar 56 bytes Scalar 24 bytes Scalar 24 bytes Scalar 24 bytes Scalar 24 bytes
Half of my application's memory usage comes from overhead on these Arrays! Is there any way to either reduce the overhead of an array, or find a more compact storage method?

Replies are listed 'Best First'.
Re: array overhead
by BrowserUk (Patriarch) on Jan 05, 2011 at 21:39 UTC

    If you join your six values into a string rather than an array, you'll save a substantial amount of memory whilst leaving the individual values quickly retrievable:

    $x = [ 123.45, 123.456, 123,456,789,012 ];; print total_size( $x );; 384 $y = "123.45,123.456,123,456,789,012";; print total_size( $y );; 80

    YOu might save a little more by packing them, but not much.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      If you use this make sure to choose a separator that won't appear in your data domain (because if you use "," and then start getting European-style formatted numbers with that as the decimal indicator rather than "." . . . ow; Perl will handle (say) a NUL ("\0") quite happily and you can be pretty sure it's not going to crop up in your inputs).

      (Not that that may be an issue for the OP but it was something that came to mind from BrowserUk's choice of sample data :)

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

        ut it was something that came to mind from BrowserUk's choice of sample data :)

        My "choice of sample data" was simply a guess at approximating the sizes of the elements shown in the OP.

        In the unlikely event that the OPs data actually includes floating point values, and his systems are configured to use ',' as the floating point, and he is dumb enough to take the commas in my string as a recommendation of a separator, completely missing the conflict that arises from doing so, then he will discover his error the first time he runs his program.

        And unless he is silly enough to make that first run a production run on a system in a hospital, or a nuclear power plant, or an aircraft--does anyone use Perl in critical systems?--then there will almost certainly be no harm done.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: array overhead
by roboticus (Chancellor) on Jan 05, 2011 at 21:42 UTC

    You didn't provide much information, so it's hard to offer specific tips. You can look at pack and unpack to compress your data into a single scalar, thus removing the array overhead, and possibly a good deal of the scalar overhead for the individual entries as well. For example, if your six values are integers, then your data size would compress to a packed string of 48 bytes and a scalar. Then you'll trade space for a bit of speed (the pack/unpack time during access).

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: array overhead
by Fletch (Bishop) on Jan 05, 2011 at 21:43 UTC

    Throwing out some ideas at random:

    • Make sure that your data is properly scoped.
    • You say it's caching but do you clear unneeded data out after however long (so long as something has a reference to it things won't get garbage collected and will sit around)?
    • Maybe rather than keeping things in RAM you might use MLDBM or Berkeley_DB and cache some or all of your data to disk instead. If you're caching to avoid recomputing something computationally expensive the hit for IO may be smaller that the tradeoff is still worth it
    • Rewrite your program in C.
    • Less drastically use memcache or redis or the like (a distributed key/value store) and its (hopefully) more memory efficient representation.

    Update: Hrmm you mention Tokyo Cabinet in another reply above so that probably negates at least two of my suggestions. Probably the first join suggestion or using pack/unpack are going to be the way to go. You might also see if you can get an unpack incantation that will work with your TC arrays.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: array overhead
by Marshall (Canon) on Jan 06, 2011 at 00:43 UTC
    cache some data for speed reasons, but after a while it starts to consume huge amounts of memory.
    Sounds like your caching algorithm needs improvement. From watching long lived Perl programs, a typical pattern is that memory usage increases until some maximum level is reached and then it stays there. Perl does not ever release memory back to the OS. Once Perl has it, it keeps it. However, Perl can re-cycle memory for its own use when you tell Perl that some memory is no longer in use.

    Perl uses reference counting to decide whether some piece of memory is still in use or not. If you have a reference to some array, undefine that reference to tell Perl that you aren't using it anymore, Perl will re-use that memory provided that there isn't some other reference to that memory.

    If your cache remembers everything for all time, then compressing the data just delays the inevitable. Eventually you will run out of memory if you keep generating stuff that is cached without removing anything from the cache!

    If this data is "stable", meaning usable for a very long time, then some DB solution might be appropriate. Another thought, I guess a simple minded approach would be to wait until the cache reaches a certain number of things in it and then "clear", undef the values in the cache. This won't release the memory back to the OS, and Perl will re-use it. So your program would "hick-up" and get slower for awhile when that happened as each value would have to be re-computed. But along with zapping everything, data that hadn't been used for a very long time would get "cleaned out" also.

    More advanced schemes would have to keep track of when some data was last used. I'm not sure in your app how difficult that would be to implement. Another thought, I don't know how well these various "Tie" modules work, but the idea is that an array is mapped to some disk file and some subset of the data is kept in memory. If things work well, then only the "recent" stuff in this file will wind up being memory resident. I haven't experimented with this and I don't know how well it works, in particular how to regulate the max memory that the "tied" file uses in memory. But in this type of scheme, the file would continue to get larger, but memory footprint would not.

    It would be helpful if you could explain more about your application. If we can compress the data by 1/2, then that only "puts off the day of reckoning" by a factor of 2, so I suspect that is not a great solution.

    This statement: "after a while it starts to consume huge amounts of memory" appears to be key. What do you know about why that could be happening? This seems to mean that it works "ok" for awhile and then something "new" happens. What do you as the app developer think that is? What kind of problem symptom happens? Consuming a "huge amount" of virtual memory may or may not be a problem. The OS may be "saving your butt" by swapping parts of memory out to disk.

    If your problem is: "I have a process that continually grows memory requirements without bound", no amount of "compressing the data" is going to solve that problem, just delay it.

      I do not have a process who's memory requirements continually grow without bound. It is looking at stock market data for a given window of time. Data in the cache that is newer than that window hasn't been read yet, and data older than that window of time is removed automatically. TIEing the arrays is not an option because the whole point of the cache is to avoid disk access. A DB solution is *already* in play - that's where the data initially comes from, and the cache is there to prevent repeatedly loading the same data from the database.

        It's hard to give any specific suggestions without some more idea of your code, but I did want to point out one option that hasn't been mentioned: SQLite offers in memory databases. It might be worthwhile as a cache, in some cases.

        You have a high performance system. You are blowing data in from the stock market and sucking data out with some analysis program.

        In between those two processes is a buffer. I would certainly consider making that buffer of a fixed size. How big it needs to be is determined by the burst rates of in and output.

        You are completely lost if the consuming process cannot on average, within a short time window, consume more than the data producing process - the system just cannot work in the high performance mode that you envision - it will "get behind" and you will have buffer overrun problems.

        If the buffer starts to panic and ask for even more system resources as it gets too full, this typically will auger into the ground very quickly and there is a system crash. How to loose data as gracefully as possible is one of the considerations. Sorry that I can't help you more. High performance system design is tough - real-time system design is even tougher.

Re: array overhead
by SuicideJunkie (Vicar) on Jan 05, 2011 at 21:42 UTC

    When determining the optimum data structure, it is important to know how you will be accessing the data inside, but you have not given many hints as to that.

    What are you building them from? How are you using them after? Do you search them? Use them like objects?

    The content of arrays is important too. Maybe you can pack them, or concatenate the values into a single string, and split them back out on demand. If you've got way more CPU than memory available, maybe even compress the data. If you're just going to run out of memory regardless, then you have to start saving it to disk.

      The data in question is "built" by fetching from an external source - either a database, or a TokyoCabinet file - and is financial data - a string for the date, four floats, and an integer. After reading them in they will be accessed (mostly) sequentially in their raw form - as arrays. Basically, I need the arrays to be arrays, but not arrays with so much overhead. I'd prefer to avoid too much CPU overhead if at all possible (speed is a secondary concern), but pack or unpack might be an option.
Re: array overhead
by ikegami (Patriarch) on Jan 05, 2011 at 22:35 UTC

    $a=[@$a] can be used to shrink the array. (Well, create a new one the minimal size.)

    $ perl -MDevel::Peek -e'push @$a, $_ for qw( a b c d e ); Dump($a,1); +$a=[@$a]; Dump($a,1);' SV = IV(0x817bc44) at 0x817bc48 REFCNT = 1 FLAGS = (ROK) RV = 0x816c158 SV = PVAV(0x816d1c8) at 0x816c158 REFCNT = 1 FLAGS = () ARRAY = 0x8177928 FILL = 4 <----- 5 visible elements MAX = 11 <----- 12 elements allocated ARYLEN = 0x0 FLAGS = (REAL) SV = IV(0x817bc44) at 0x817bc48 REFCNT = 1 FLAGS = (ROK) RV = 0x816cea8 SV = PVAV(0x816d1dc) at 0x816cea8 REFCNT = 1 FLAGS = () ARRAY = 0x8183cb8 FILL = 4 <----- 5 visible elements MAX = 4 <----- 5 elements allocated ARYLEN = 0x0 FLAGS = (REAL)

    But you might only end up fragmenting your memory, leaving less usable free memory behind.

      Yeah, 232 byes is a lot of overhead for an array. My back-of-the envelope calculations show a clean, minimal array should have an overhead of about 60 bytes on a 32-bit system, so that shrinking should make a difference. I doubt that fragmentation will be an issue. The OP probably wants to shrink the SV holding the integer value too, since it appears to be 56 bytes, so probably still holds a string too ($a[5] = int $[a5] should do it).

      Dave.

        Assigning a number to a scalar doesn't free the string buffer, it just marks it as unused. You "solution" can actually increase the memory used.

        $ perl -MDevel::Size=total_size -E' $a[5]="5678"; say total_size($a[5]); $x=int($a[5]); say total_size($a[5]); ' 40 44

        undef $s (but not $s = undef;) will free the string buffer.

        $ perl -MDevel::Size=total_size -E' $a[5]="5678"; say total_size($a[5]); $t=int($a[5]); undef $a[5]; $a[5]=$t; say total_size($a[5]); ' 40 36

        It doesn't downgrade the scalar from PVIV to IV, though. You'd need to start with a new scalar to do that.

        $ perl -MDevel::Peek -MDevel::Size=total_size -E' $a[5]="5678"; say total_size($a[5]); $t=int($a[5]); delete $a[5]; $a[5]=$t; say total_size($a[5]); ' 40 16
Re: array overhead
by Anonyrnous Monk (Hermit) on Jan 05, 2011 at 21:46 UTC
    Is there any way to either reduce the overhead of an array

    No. The overhead is a result of the way data structures are represented Perl-internally (see PerlGuts Illustrated).

    or find a more compact storage method

    Most likely yes. The details depend on what exactly you are storing in the arrays, and how easily and fast you want to access it.

      One string, four floats, and an integer. I'd like to access it as quickly as possible - that is to say with as little CPU overhead as I can manage.

        In this case I'd pack each set (with the fixed-sized elements (floats/int) first) into a single (binary) string.

Re: array overhead
by talexb (Chancellor) on Jan 06, 2011 at 18:21 UTC

    This may sound facile, but have you considered installing more memory?

    Another alternative is to do a little research into how long your script runs, and at what point it ".. starts to consume huge amounts of memory." Is this after 20 minutes? Five hours? Three days? Two weeks? If your script looks at the most recent four hours, and starts to bog down after three days, then presumably you could restart it every two days as a temporary solution.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds