Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is yet another variation on the old chestnut of trying to trim down the memory usage of a long running Perl script. I have a long running (but not permanently running) process - a batch job, we'll call it - that analyzes financial data from the stock market. Specifically it looks at rows of stock data that consist of a date (a string), four floats, and an integer. It pulls this data from a database or a TokyoCabinet file, and a major source of overhead is this I/O, so it caches rows to reduce access to the disk. Because of the underlying processing being done, I can make very strong assumptions about how data in the cache will be used, and aggressively delete data that falls outside of a given time window. However, even with this deletion, the cache can end up consuming huge amounts of memory. Using Devel::Size and Devel::Size::Report, I have narrowed down the overhead on small arrays as a major culprit:
Array ref 440 bytes (overhead: 232 bytes, 52.73%) Scalar 56 bytes Scalar 56 bytes Scalar 24 bytes Scalar 24 bytes Scalar 24 bytes Scalar 24 bytes
Each stored row has an overhead of 50% - half the memory being used by this application is array overhead! I have tried packing and unpacking the data to store it as scalars, but this causes performance to be even slower than without the cache. TIEing the arrays to disk would defeat the purpose of the cache. So - is there any way at all to reduce the overhead of these arrays that doesn't involve pack and unpack or deleting from the cache (which I'm already doing)?

Replies are listed 'Best First'.
Re: bit by overhead
by BrowserUk (Patriarch) on Jan 06, 2011 at 18:12 UTC
    So - is there any way at all to reduce the overhead of these arrays that doesn't involve pack and unpack

    Why exclude solutions that involve pack and unpack?

    C:\test>p1 $x = ['20110106', map( rand( 1e5 ), 1..4 ), int( rand 1000 ) ];; print total_size $x;; 440 print for @$x;; 20110106 4663.0859375 274.658203125 53915.4052734375 11352.5390625 145

    Using join & split (72% saving):

    $y = join $;, @$x;; print total_size $y;; 120 print for split $;, $y;; 20110106 4663.0859375 274.658203125 53915.4052734375 11352.5390625 145

    Using pack & unpack (78% saving):

    $z = pack 'c/A* d4 n', @$x;; print total_size $z;; 96 print for unpack 'C/A* d4 n', $z;; 20110106 4663.0859375 274.658203125 53915.4052734375 11352.5390625 145

    Of course, there is some performance penalty for spliting or unpacking the arrays when thay are used, but it's not onerous:

    #! perl -slw use strict; use Devel::Size qw[ total_size ]; use Benchmark qw[ cmpthese ]; my @AoA = map[ '20110106', map( rand( 1e5), 1..4 ), int( rand 1000 ) ], 1 .. 1e5; my @AoS = map{ join $;, @$_; } @AoA; my @AoP = map{ pack 'C/A* d4 i', @$_; } @AoA; print 'AoA: ', total_size \@AoA; print 'AoS: ', total_size \@AoS; print 'AoP: ', total_size \@AoP; cmpthese 10, { AoA => sub { my $sum; for my $i ( 0 .. $#AoA ) { $sum += $AoA[ $i ][ $_ ] for 1 .. 5; } # print $sum; }, AoS => sub { my $sum; for my $i ( 0 .. $#AoS ) { my @a = split $;, $AoS[ $i ]; $sum += $a[ $_ ] for 1 .. 5; } # print $sum; }, AoP => sub { my $sum; for my $i ( 0 .. $#AoP ) { my @a = unpack 'C/A* d4 i', $AoP[ $i ]; $sum += $a[ $_ ] for 1 .. 5; } # print $sum; }, }; __END__ C:\test>880868 AoA: 70400176 AoS: 13549296 80% saving AoP: 10400176 85% saving Rate AoS AoP AoA AoS 0.977/s -- -52% -78% AoP 2.05/s 110% -- -53% AoA 4.39/s 349% 114% --

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      "I have tried packing and unpacking the data to store it as scalars, but this causes performance to be even slower than without the cache." It turns out that the overhead is onerous - or at least onerous enough that I may as well just not even use the cache.
        "I have tried packing and unpacking the data to store it as scalars, but this causes performance to be even slower than without the cache." It turns out that the overhead is onerous - or at least onerous enough that I may as well just not even use the cache.

        Then you are doing it wrong.

        There is no way that unpacking six values from a packed string should be slower than reading those same six values from disk. And querying them from a DB will be far slower still.

        I guess it is time you started posting some of your code and let us see what you are doing wrong.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: bit by overhead
by afoken (Chancellor) on Jan 06, 2011 at 16:57 UTC
    This is yet another variation ...

    ... of array overhead. So why do you start another thread?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Because without exception, every single answer in that thread pointed out that I didn't give enough detail for them to answer the question. :) Also, I've tried the most popular suggestion and found it doesn't work very well in this case.
        Because without exception, every single answer in that thread pointed out that I didn't give enough detail for them to answer the question.

        And why didn't you post the missing details right there, where your problem was (and still is) already discussed?

        Also, I've tried the most popular suggestion and found it doesn't work very well in this case.

        Another little detail that belongs into the other thread.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: bit by overhead
by ikegami (Patriarch) on Jan 06, 2011 at 17:30 UTC
    What caused those arrays to have that overhead in the first place? Are you populating them using push? Populate them using an assignment instead.

      To OP, you might find this helpful: Difference between this array assignment and push

      To ikegami - I'm a bit confused because it seems like you are saying the exact opposite in your reply at the bottom of that thread. There you argue that the difference between push and assignment should be negligible, but if there is a difference, assignment is faster, and push is more memory efficient.

      That was in 2007. Have your views changed? Has Perl changed? Is there further background information you could give that would help reconcile the post above with your reply on that thread?

        I'm also baffled by this:

        $x = ['20110106', map( rand( 1e5 ), 1..4 ), int( rand 1000 ) ];; print total_size $x;; 440 @x = @{ $x };; print total_size \@x;; 440

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        My post in that thread discusses the efficiency of building an array. It doesn't discuss the size of the resulting array.