bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

This is a follow up to an earlier query, where I ran out of memory trying to output a very large hash. After somebody pointed me at autovivification as a potential culprit, I started narrowing down to a minimal test-case. I believe that this test-case rules out autovivification, but I'm not sure what it does suggest.

Here is a script that works correctly on my machine (AIX 5.2, Perl 5.8.7). Note that the code prints to screen, so call it as perl -w test.pl > test.txt.

use strict; my %hash1; my %hash2; my @data = (0 .. 5000); for (my $i = 0; $i < scalar(@data); $i++ ) { for (my $j = $i; $j < scalar(@data); $j++) { $hash1{ $data[$i] }{ $data[$j] } = 0; $hash2{ $data[$i] }{ $data[$j] } = 0; } } while ( my ($key1, $val1) = each %hash1 ) { while ( my ($key2, $val2) = each %$val1 ) { print join( "\t", $key1, $key2, $hash1{$key1}{$key2}, $hash2{$key1}{$key2} ), "\n"; } }

Now here is a script that crashes with Out of memory! on my machine. The only different I can see between these two is that the first one has only two hashes and the second one three.

use strict; my %hash1; my %hash2; my %hash3; my @data = (0 .. 5000); for (my $i = 0; $i < scalar(@data); $i++ ) { for (my $j = $i; $j < scalar(@data); $j++) { $hash1{ $data[$i] }{ $data[$j] } = 0; $hash2{ $data[$i] }{ $data[$j] } = 0; $hash3{ $data[$i] }{ $data[$j] } = 0; } } print "Finished building hashes\n"; while ( my ($key1, $val1) = each %hash1 ) { while ( my ($key2, $val2) = each %$val1 ) { print join( "\t", $key1, $key2, $hash1{$key1}{$key2}, $hash2{$key1}{$key2}, $hash3{$key1}{$key2} ), "\n"; } }

Note that I've verified that the crash occurs *after* all three hashes are populated. Further, if I delete hash keys after I've processed them the script runs to completion:

use strict; my %hash1; my %hash2; my %hash3; my @data = (0 .. 5000); for (my $i = 0; $i < scalar(@data); $i++ ) { for (my $j = $i; $j < scalar(@data); $j++) { $hash1{ $data[$i] }{ $data[$j] } = 0; $hash2{ $data[$i] }{ $data[$j] } = 0; $hash3{ $data[$i] }{ $data[$j] } = 0; } } print "Finished building hashes\n"; while ( my ($key1, $val1) = each %hash1 ) { while ( my ($key2, $val2) = each %$val1 ) { print join( "\t", $key1, $key2, $hash1{$key1}{$key2}, $hash2{$key1}{$key2}, $hash3{$key1}{$key2} ), "\n"; } delete $hash1{$key1}; delete $hash2{$key1}; delete $hash3{$key1}; }

So as far as I can tell, this shows that the error occurs when trying to output multiple large HoHs with identical keys simultaneously. I can't quite see why memory would be getting used in this process, though!!

Replies are listed 'Best First'.
Re: Memory Error Printing Multiple Hashes Simultaneously
by BrowserUk (Patriarch) on Jan 31, 2006 at 01:33 UTC

    I concur. There is sustained and heavy memory allocation going on in the second loop for which I can see no apparent reason. There is no autovivification of hash elements occuring. You have discovered another memory leak and the culprit is join(*).

    If you replace your second loop with

    while ( my( $key1, $val1 ) = each %hash1 ) { while ( my( $key2, $val2 ) = each %$val1 ) { print( ##join( "\t", $key1, $key2, $hash1{ $key1 }{ $key2 }, $hash2{ $key1 }{ $key2 }, $hash3{ $key1 }{ $key2 } ); } }

    The code runs to completion and shows zero memory growth during the second loop.

    The problem also exists in 5.6.2. Raise a perlbug.

    *Update: It's more complicated than just join. Seems it needs both join and a reference to a compound hash plus some unknown factor to cause the leak.

    If you remove the join, or the hash reference, or make either of the keys a constant and the leak does not occur!

    while ( my( $key1, $val1 ) = each %hash1 ) { while ( my( $key2, $val2 ) = each %$val1 ) { die 'Autovivify' unless exists $hash1{ $key1 } and exists $hash1{ $key1 }{ $key2 }; print( join "\t", $key1, $key2, $hash1{ $key1 }{ $key2 }, ); } }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks BrowserUK, I thought I was going crazy. I wonder if the extra factor might be the tabs themselves, as I can still get an out-of-memory error by replacing the join with manual tabs:

      use strict; my %hash1; my %hash2; my %hash3; my @data = (0 .. 5000); for (my $i = 0; $i < scalar(@data); $i++ ) { for (my $j = $i; $j < scalar(@data); $j++) { $hash1{ $data[$i] }{ $data[$j] } = 0; $hash2{ $data[$i] }{ $data[$j] } = 0; $hash3{ $data[$i] }{ $data[$j] } = 0; } } print "Finished building hashes\n"; while ( my( $key1, $val1 ) = each %hash1 ) { while ( my( $key2, $val2 ) = each %$val1 ) { die 'Autovivify' unless exists $hash1{ $key1 } and exists $hash1{ $key1 }{ $key2 }; print "$key1\t$key2\t$hash1{ $key1 }{ $key2 }\t$hash2{ $key1 } +{ $key2 }\t$hash3{ $key1 }{ $key2 }\n"; } }

        Well, I'm almost ashamed to tell you that I tried this :), but it doesn't matter what character you use in the catenation, whether with join or interpolation into a string, it still causes the memory growth. I even tried calling a function to do the output, passing the values as a list and joining them inside the function, and still it occurs?

        I do have a work around for you.

        print map{ $_, "\t" } $key1, $key2, $hash1{ $key1 }{ $key2 }, $hash2{ $key1 }{ $key2 }, $hash3{ $key1 }{ $key2 }; print "\n";

        Replace your print line with the above, and the memory growth will completely disappear. It isn't an exact replacement for the join, as you will get an extra tab at the end of the line. If that is a problem then you can go for manually interspersing the tabs:

        print $key1, "\t", $key2 "\t", $hash1{ $key1 }{ $key2 }, "\t", $hash2{ $key1 }{ $key2 }, "\t", $hash3{ $key1 }{ $key2 }, "\n";

        That both these avoid the memory growth indicates that the problem comes from building the single output string, which is something I guess, but this is bread and butter Perl code and it would surely have been noticed between 5.6.2 and now if it was that simple.

        I have completely failed to reproduce the problem outside of two nested while loops accessing compound hashes using variables as keys. Every simplification I apply to reduce your code to a testcase and the problem disappears.

        Anyway, I hope the above insights will allow you to get on with solving your problem; but do please raise a perlbug and let the guys that understand this stuff have a go at a proper solution/explanation.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Memory Error Printing Multiple Hashes Simultaneously
by bernanke01 (Beadle) on Jan 31, 2006 at 20:47 UTC

    And here is a speedy reply to the perlbug by Gisle Aas:

    This is not a memory leak. The reason perl allocates memory during printing is that it converts all the 0 values in the hashes to strings + and stores these. This upgrades your SvIV structs containing 0 into the much larger SvPV structs containing both 0 and "0" (see [1]). --Gisle [1] http://search.cpan.org/src/GAAS/illguts-0.09/index.html#sviv

    I won't claim to fully understand, but I guess it means that Perl is changing the hash from storing simply integers to storing both integers & strings during printing. And this change occurs on the hash itself, increasing memory usage. Printing without string interpolation (or join prevents this "upgrading" and thus prevents all the memory usage. I guess it makes sense, but I'm just really surprised no one had come across this before!

      One way to avoid this upgrade to strings is to use printf and ask for the numeric values of the hash values:
      printf("%s\t%s\t%d\t%d\n", $key1, $key2, $hash1{$key1}{$key2}, $hash2{$key1}{$key2});

      Glad you got an speedy explanation. It seems so obvious once it is pointed out, but then no one else here spotted the cause either:(

      It very non-intuative that the rates of memory growth you were seeing could be attributed to the simple act of stringifying previously numeric values. Once you know what to look for, it is easy to see the how quickly the memory usage grows when ths happens.

      A (minimum) of 70% growth for integers:

      c:\Perl\test>p1 [0] Perl> use Devel::Size qw[ total_size ];; [0] Perl> @a = (0) x 1e6;; [0] Perl> print total_size( \@a );; 20000056 [0] Perl> $_.='' for @a;; [0] Perl> print total_size( \@a );; 34000056

      And a whopping 300% for floats.

      c:\Perl\test>p1 [0] Perl> use Devel::Size qw[ total_size ];; [0] Perl> @a = ( 1.0 ) x 1e6;; [0] Perl> print total_size( \@a );; 24000056 [0] Perl> $_.='' for @a;; [0] Perl> print total_size( \@a );; 75000056

      Definitely one to remember if you are running close to the memory limits of your machine.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yup, and the actual use-case was floats (simplified down to integers for debugging). Thanks for all your help with this BrowserUk.

      Just to explain in more detail:

      N:\>perl -MDevel::Peek -e"my @x=(1..2); Dump($x[0]); $y=join '-',@x; D +ump($x[0]); SV = IV(0x182b3e4) at 0x2257e8 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 1 SV = PVIV(0x225ef4) at 0x2257e8 REFCNT = 1 FLAGS = (IOK,POK,pIOK,pPOK) IV = 1 PV = 0x22d3fc "1"\0 CUR = 1 LEN = 2

      The first shows $x[0] when its an SvIV. The second output is the same var after it has been stringified (which means its been "upgraded" to an SvPVIV). Note that there are now additional slots in the SV that are populated, containing such things as the stringified version of the integer,(PV) how long the string is, and how much of the allocated string is actually used. (LEN and CUR).

      What this means is that if you want to avoid the problem simply use the form

      print join map { "$_" },@list;

      which will make the SV that gets upgraded be a temporary, leaving the original unchanged.

      My guess why it hasnt come up before is that probably its unusual for such a large structure to contain only SvIV's. So under normal circumstance id guess that if an OOM error occured it occured while building the structure itself and not when it was printed out.

      ---
      $world=~s/war/peace/g

        Ahh, cool. Devel::Peak is much less scary when it's applied to a problem you already understand a bit. Thank you!
Re: Memory Error Printing Multiple Hashes Simultaneously
by Fletch (Bishop) on Jan 30, 2006 at 23:56 UTC

    You're allocating 3 hashes with 25,000,000 elements each and don't see where memory would be getting used?

    /boggle

    I just ran this (the three hash version) and the process size got up over 1.5G of memory used (OS X 10.4.4, stock perl); I don't think it's out of the question that that much usage would hit up against either a soft or hard process size limit on an arbitrary AIX box. Check with the output of ulimit -a and/or your sysadmin.

      Oh I realize that, but the thing is the hashes seem to get created okay, its just during printing that the machine hits a memory problem. And, deleting keys as I print them solves the issue: that's the puzzling part. Where in my nested-while loops would memory be getting used? The printing gets about half-way through before crashing, so at that stage I can't see why additional RAM would get used (it's 32-bit perl, so while the box has more maximum addressable would be 4Gb I guess).

      Update: Also, it occurred to me that there should only be (5001 * 5000 / 2) = ~12.5 million elements in each hash, so not quite so bad as having three hashes of 25 million!