gone2015 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use Devel::Size to get the size of a large hash -- 5 million or so entries, where the keys are random strings of 32 characters plus "\n", and the values are all undef

If I run the code below, which builds a hash of just 1 million entries I get:

                   :  Start   : 1st Read : undef    : 2nd Read :
    Hash size      :   0.0 MB :  71.9 MB :   0.0 MB :  71.9 MB :
    Total hash     :   0.0 MB :  94.8 MB :   0.0 MB :  94.8 MB :
                   :          :          :          :          :
    Virtual Memory : 100.9 MB : 391.1 MB : 383.1 MB : 399.1 MB :
    Resident       :   2.8 MB : 292.9 MB : 284.9 MB : 300.9 MB :
So... having 1st read 31.5 MB of file, the hash is apparently 94.8 MB (71.9 MB given by Devel::Size::size(\%hash) plus 24 * 1E6, where each undef entry is 24 bytes.) This is odd -- 94.8 MB of hash appears to require ~ 290 MB of memory ? I tried commenting out the assignment $hash{$_} = undef, and the loop consumed no memory at all -- so whatever is going on, it's to do with the hash !

When I undef %hash the hash, the memory footprint reduces by just 8 MB (391.1 MB -> 383.1 MB). I cannot tell what free space Perl has, so this may or may not be reasonable.

When the file is read the 2nd time, the memory footprint grows to a bit more than it was after the 1st read. I'm not sure what to make of that.

Finally, if I comment out the call to Devel::Size::size() the System Monitor tells me:

                   :  Start   : 1st Read : undef    : 2nd Read :
    Virtual Memory : 100.9 MB : 270.0 MB : 262.0 MB : 277.9 MB :
    Resident       :   2.8 MB : 171.8 MB : 163.8 MB : 179.7 MB :
which is similar, except that the footprint is some 120 MB less !!! I don't know what that tells me about the usefulness of Devel::Size ??

I'm enquiring, because once I get to 5 million such strings in the hash, the footprint has grown to 1.0GB or so, and my machine is thrashing when I try to use the hash :-( (and <c>Devel::Size::size() takes longer and longer...)

Help !! (OK, I could go and get some more memory, 1G is not a lot in today's market...)

For completeness: perl, v5.10.0 built for x86_64-linux-thread-multi, and Linux 2.6.25.14-108.fc9.x86_64 #1 SMP Mon Aug 4 13:46:35 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

use strict ; use warnings ; my %hash = () ; $hash{'dummy'} = undef ; printf STDERR "Single entry requires: %d Bytes\n", bytes($hash{dummy}) + ; read_it() ; undef %hash ; show(0, \%hash) ; wait_for_it('Just "undef"ed the hash') ; read_it() ; sub read_it { open my $FH, "<", "hash.txt" ; wait_for_it('About to read') ; my $e = 0 ; my $c = show($e, \%hash) ; while (<$FH>) { $hash{$_} = undef ; $e++ ; $c = ($c - 1) || show($e, \%hash) ; } ; wait_for_it('Finished Reading') ; } ; sub show { my ($e, $rh) = @_ ; printf STDERR "%8d entries: %3.1fM Bytes\n", $e, mbytes($rh) ; return 50_000 ; } ; sub mbytes { my ($r) = @_ ; bytes($r)/(1024 * 1024) ; } ; use Devel::Size () ; sub bytes { my ($r) = @_ ; return Devel::Size::size($r) ; } ; sub wait_for_it { print STDERR "$_[0]..." ; my $ch = '' ; while ($ch !~ m/\./) { sysread STDIN, $ch, 1 ; } ; } ;

Replies are listed 'Best First'.
Re: Completeness of Devel::Size answers
by BrowserUk (Patriarch) on Sep 22, 2008 at 23:48 UTC

    The problem is that Devel::Size itself uses a substantial amount of memory. Internally, it builds a hash of all the SVs (scalars) it encounters whilst traversing the stucture being sized. It does has to do this in order to detect circular references that would otherwise cause it never to return.

    For example, without circular reference detection, the following code would never return:

    use Devel::Size qw[ total_size ]; my $scalar; $scalar = \$scalar; print total_size( $scalar );

    Without Devel::Size, your 5e6 key hash with 33 char keys and undef values will consume a little less that 600MB.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      OK... If I run the code with Devel::Size::size it says the 1 million entry hash requires 94.8 M. If I run with the calls to Devel::Size::size commented out, the VM still grows from ~100M to ~270M -- nearly twice what's expected :-(

      For the full 5 million, VM grows to 1.0G, so the hash appears to need 900M -- which is a bit more than 188 bytes per entry... where Devel::Size::size suggested 100 ??

      Mind you, I am assuming that all the increase in VM size is down to the hash -- which I think it is, because if I comment out the assignment to the hash, the VM doesn't change size.

      Thanks.

        Mind you, I am assuming that all the increase in VM size is down to the hash

        If you could inspect the memory, you would probably find that a substantial portion of the total memory allocated to the process is actually free for re-use once the hash has been built.

        This is because as the hash grows in size, as it fills towards the maximum capacity of it's current size, it reaches a point where the Perl decides it is time to double the number of buckets. In effect, a new hash is allocated that has twice the capacity of the old one and the contents of the old one are copied into the new before any new keys are added. Once the copy is complete, the memory for the old copy is freed back to the process pool (but not the system) for reuse. So whilst it may take (say) 600MB to hold the completely built hash, there is a brief point in time where two copies of the hash so far are required.

        If you have the ability to monitor the process memory usage as the hash is built, you'll see the graph of the memory allocated versus time looks something like this:

        _________________| | | | | | | | ________| | | | ____| | __| |

        Where both the vertical jumps (memory allocated) and horizontal jumps (time between reallocs) doubles. It is an effective strategy for performance, but can mean that a substantial portion of the memory allocated to the last but one incarnation of the hash doesn't get re-used (by the hash). However, that memory is available for use by the rest of your program.

        One thing that may allow you to reduce the overall memory consumption of your program, is to ensure that the big hash gets built first. For example, you suggest that you've already loaded/generated 100MB of data in memory before you build this hash. If you delayed the creation/loading of that data until after the hash is constructed, you may find that there is enough memory, (allocated during the construction of the hash, but freed back to the pool before it is complete), to accomodate most or all of that 100MB without the need to request more from the OS.

        If you're banging your head against the limits of your machine, that may just be enough to stop you going into swapping. Worth a try at least if your program logic can accommodate that change.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      You can detect circles without all the bookkeeping overhead, but you wouldn't be able to calculate the total size of a structure containing circular references (you may end up counting some elements multiple times, and you can't be sure you traversed the whole structure before you detected a circle).
      The problem is that Devel::Size itself uses a substantial amount of memory. Internally, it builds a hash of all the SVs (scalars) it encounters whilst traversing the stucture being sized. It does has to do this in order to detect circular references that would otherwise cause it never to return.

      This strikes me as pessimistic. With two passes, the first could collect all references, and avoid following repears repeats -- avoiding loops and avoiding counting things twice. The second pass could generate the size count.

      Actually, AFAIKS, the only reason you need two passes is because with ref:SCALAR you cannot otherwise tell that whether the SCALAR is internal or external, and, if internal, whether it's already been counted. My guess is that ref:SCALAR is relatively rare...

      ... perhaps the trick is to proceed optimistically, constructing the size during pass 1. At the end, if there are any ref:SCALAR entries, need to run a half-pass to rescan the structure, looking for any referred to internal SCALARs, up to where they are first referred to.

      ... I can see that for some structures one and a half passes would be more expensive than the current approach. However, at least the penalty is paid where things are difficult. The current stuff falls over if the data is large, however simple.

      Rats. Something in the back of my mind is telling me that string value SVs may implicitly refer to the string... But then there would be a ref-count > 1... which could be picked up in Pass 1...

Re: Completeness of Devel::Size answers
by roboticus (Chancellor) on Sep 23, 2008 at 11:18 UTC
    oshalla:

    Others have covered your question. I just wanted to offer a comment:

    The hash is a great general-purpose data structure, but it's not perfect for everything. In cases where you start pushing the limits, you need to consider alternatives. Specifically, in this case, you may want to think about your data structure. Figure out the requirements and see how else you might meet them with a different data structure. With a tweak to your requirements or your algorithm, you might find a way to make an array do the job, which could save you a good bit of memory. Or perhaps you might want to make a tied reference to a DBM database or some such. If your application is interested only in the key values, it may be satisfied by a sorted array. Or perhaps a large delimited string you could search through.

    ...roboticus