jbisbee has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

Attention: the corrupted storable file I have was created under perl 5.6.0 / Storable 1.013 and reproduced under 5.6.2 / Storable 2.13. If don't have a 5.6 install handy, then you won't be able to reproduce the issue because the storable file isn't binary compatible with 5.8

I have a issue that shows up every couple of days. A storable file (happens to be a user profile) gets corrupted somehow and when I attempt to retrieve it, I get an "Out of memory!" error. (and I have _plenty_ of memory to spare for the small frozen structure)

This really sucks because its under mod_perl and it actually kills a httpd proccess each time the user with the "bad" profile attempts to go visit our site.

I've filled a bug here against Storable and figured I'd see if anyone has a minute to take it further than my limited C/XS skills could take me.

Here is the text from my bug report:

You can get the corrupt storable file, out-of-memory.storable, here (attached to the bug)

I've verified this issue with both perl 5.6.0 / Storable 1.013 perl 5.6.2 / Storable 2.13 I looked at the Storable.xs code in 2.13 and turned on the DEBUGME flag and reran my test retrieve code. (test.pl) #!/usr/bin/perl use Storable qw(retrieve); $Storable::DEBUGME = 1; retrieve(shift); Then ran it ./test.pl out-of-memory.storable and got this output (truncated to show the error) ** extending kbuf to -184127888 bytes (had 128) Out of memory! I tracked it down to retreive_hash() and this code KBUFCHK((STRLEN)size); /* Grow hash key read pool if needed */ Storables are created every couple of days that trigger this bug and because these are profiles and we have no way of trapping the exception, our we have users that are kill httpd processes each time they attempt to load our site until we remove the offending storable file. I've attached the 'out-of-memory.storable' file so you can easily reproduce this problem. From my basic knowledge I'm wondering if we could just ASSERT if you attempt to grow the hash key read pool if its not a positive integer. (I stress basic) :)
I'm just looking for the "proper" way to throw an exception in this test case so I can trap the exception in an eval rather than having a click happy user take down httpd process after httpd process util we remove the offending profile.

-biz-

  • Comment on Small Corrupted Storable Retrieve Dies With an Invalid "Out of memory!"
  • Download Code

Replies are listed 'Best First'.
Re: Small Corrupted Storable Retrieve Dies With an Invalid "Out of memory!"
by tachyon (Chancellor) on Oct 27, 2004 at 06:16 UTC

    Well I don't have the full answer but I can fill in the picture. As it turns out it is not a bug in the READING of the file per se (although this is when it crashes), rather it is almost certainly a bug with the WRITING of the storable file. Basically the problem actually occurs in the RLEN macro, immediately above the KBUFCHK. This RLEN macro reads an integer from the Storable file that specifies the size of the next data chunk to be read. The problem occurs because instead of some reasonable figure like N bytes being read corruption of the file leads to pseudorandom 4 byte int being read. This value could be anything from 0-4 billion. If the numerical value of this exceeds the default 128 byte buffer KBUFCHK tries to malloc the required space. Although you have __plenty__ of memory you don't have 4GB, which is what it is trying to malloc. Thus the problem is not really with the reading side per se, the data was actually written corruptly. At least that is how it seems. The main debugging problem is that the precise original data has been lost.

    I applied the following debugging patch:

    94,95d93 < #define DEBUGME < 447d444 < TRACEME(("** extending kbuf to %u bytes (had %u)", (un +signed int)x+1, ksiz)); \ 896d892 < TRACEME(("1 RLEN x=%u", (unsigned int) x)); \ 898,899c894 < TRACEME(("2 RLEN x=%u", (unsigned int) x)); \ < if (!cxt->fio) { \ --- > if (!cxt->fio) \ 901,902d895 < TRACEME(("3 RLEN x=%u", (unsigned int) x)); + \ < } \ 905d897 < TRACEME(("4 RLEN x=%u", (unsigned int) x)); \ 908d899 < TRACEME(("5 RLEN x=%u", (unsigned int) x)); \ 4815d4805 < if ( (unsigned)size > 1000000 ) Perl_croak(aTHX_ "Bugg +er!\n");

    Note you can catch that Perl_croak BTW but it will leak leak memory as I am not freeing. The number is not negative either (it is a large unsigned int). More on croaking in a second.

    With the patch applied the end of the error trace now looks like:

    OK so all that basically shows is that RLEN is supposed to return the length of the next bit of data and that at the point it chokes it returns a rather large integer. It dies out of memory because it tries to malloc 4GB. Anyway the integer 4110839407 is 0 x F5 06 6E 6F

    Looking at the datafile with a hex editor we see:

    00000FE0: 05 00 00 00 63 6F 75 6E 74 05 05 00 00 00 66 69 ....coun +t.....fi 00000FF0: 72 73 74 07 00 00 00 73 65 73 73 69 6F 6E 06 F5 rst....s +ession.. s e s s i o^ n^ ^ ^ 00001000: 73 73 69 6F 6E 06 F5 00 00 00 0D 00 00 00 73 65 ssion... +......se s s i o n ^ ^ 00001010: 73 73 69 6F 6E 5F 63 6F 75 6E 74 0A 01 4C 04 00 ssion_co +unt..L.. NOTE >>>>> Offset of the error is 4096 bytes.

    The data 6F 6E 06 F5 we see is no great suprise as the RLEN macro is essentially just doing an fread (#define PerlIO_read(x,y,z) fread(y,1,z,x) ) due to the authors insane love of macros. Anyway these 4 bytes of data are where our 4 billion comes from.

    If you look at the previous keys you see we should be reading a small integer ie 0X 00 00 00 (we are little endian on x86 so you read it R->L ). You will also note that problem occurs at an offset of 4096 bytes. You can also note the ..ssion which looks supiciously like an offset error. Even more so when you note that following ..ssion we have 06 F5 00 00 which is a far more typical pattern. Also we have the same two bytes 06 F5 after session and ..ession. In fact if you get delete the 7 ..ssion bytes you don't error out there anymore. You can also delete the 7 bytes of session. It crashes again another K or so into the file but the data it is extracting all looks valid.

    So that is my hint. The problem is that it will probably be a very specific data set that casues what I expect is a buffer issue.

    If you apply the suggested single line Perl_croak patch (between RLEN and KBUFCHK calls around line 4810 in Storable.xs)

    RLEN(size); /* Get key size */ + if ( (unsigned)size > 1000000 ) Perl_croak(aTHX_ "Bugger!\n"); KBUFCHK((STRLEN)size); /* Grow hash key read pool if needed */

    then in your application you could (after you have stored the data) try a retrieve. You can catch that Perl_croak with an eval so it won't kill your mod_perl child. If it does croak you can capture the raw data structure that causes the problem. This immediately identifies a problem profile that needs fixing but more importantly it will give you a test case data structure. It may be possible to debug it without this test case but it would be a lot easier if you have at least one problem data structure to validate the patch against.

    store \%profile, $file; eval{ retrieve($file) }; if ( $@ ) { require Data::Dumper; warn "\n\n\nStorable Choked\n\n\n"; warn Data::Dumper::Dumper(\%profile); print "Content-Type: text/html\n\nBugger!\n"; # ;-) exit 1; }

    cheers

    tachyon

      In your "patch" you used the following code
      if ( (unsigned)size > 1000000 ) Perl_croak(aTHX_ "Bugger!\n")
      I noticed that the author had a CROAK macro and wondered why you opted for the Perl_croak and what the difference was.
      /** Storable.xs 1.0.13 **/ #define CROAK(x) do { cxt->s_dirty = 1; croak x; } while (0)
      /** Storable.xs 2.13 **/ #define CROAK(x) STMT_START { cxt->s_dirty = 1; croak x; } STMT +_END
      I understand that the cxt->s_dirty is used to let Storable reuse the memory next time around, just wondering what the difference between croak and Perl_croak is.

      -biz-

        I was too lazy to look at the syntax of yet another macro. Storable is just one big macro! It is probably better to do:

        if ( (unsigned)size > 1000000 ) CROAK(("Bugger!\n"));

        Note that you need double parenths as the call to croak does not have any! croak is a Perl alias for Perl_croak_nocontext so the call is near identical but you are setting the context to dirty, looking at the source this is caught and cleaned so should fix the leak.

        cheers

        tachyon

Re: Small Corrupted Storable Retrieve Dies With an Invalid "Out of memory!"
by tachyon (Chancellor) on Oct 27, 2004 at 13:58 UTC

    I have identified the errors in the file. There are 8 bytes that appear wrong. 7 added bytes (duplicate) and one missing byte by my calculation.

    [root@devel3 Storable.patch]# cat /usr/bin/bdiff.pl #!/usr/bin/perl use Algorithm::Diff qw(traverse_sequences); my ( $f1, $f2 ) = @ARGV; open $fh1, $f1 or die $!; open $fh2, $f2 or die $!; local $/; my $seq1 = [split //, <$fh1>]; my $seq2 = [split //, <$fh2>]; my (@match,@d1, @d2) = ((),(),()); traverse_sequences( $seq1, $seq2, { MATCH => sub { push @match, $seq1->[$_[0]] }, DISCARD_A => sub { push @d1, printable2($_[0])." : - ".printable($ +seq1->[$_[0]]) }, DISCARD_B => sub { push @d2, printable2($_[1])." : + ".printable($ +seq2->[$_[1]]) }, }); sub printable { sprintf "%s 0x%02X", ord($_[0])>32?$_[0]:'.',ord $_[0] + } sub printable2 { sprintf "%08d 0x%08X", $_[0],$_[0] } print join "\n", @d1,@d2,"\n"; [root@devel3 Storable.patch]# bdiff.pl store.original store.patch 00004096 0x00001000 : - s 0x73 00004097 0x00001001 : - s 0x73 00004098 0x00001002 : - i 0x69 00004099 0x00001003 : - o 0x6F 00004100 0x00001004 : - n 0x6E 00004101 0x00001005 : - . 0x06 00004102 0x00001006 : - õ 0xF5 00004944 0x00001350 : + . 0x05 [root@devel3 Storable.patch]#

    So you need to remove the 7 bytes I originally suggested and then add 0x05 at offset 4944 in the *patched* file. ie this it the right offset if you have already removed the 7 bytes previously noted. The hexedit should look like:

    00000FD8: 72 73 74 07 00 00 00 73 65 73 73 69 6F 6E 06 F5 00 00late +st......count.....fi 00000FF0: 65 73 73 69 6F 6E 5F 63 6F 75 6E 74 0A 01 4C 04 00 00rst. +...session.........s 00001008: 03 0E 00 00 00 04 02 00 00 00 00 14 00 00 00 61 63 74essi +on_count..L....type. 00001338: 05 05 00 00 00 6C 61 62 65 6C 0A 0B 78 78 78 78 78 78XX<B +R>XXXXXXXX XXXXXXXXX 00001350: 00 00 00 69 64 0A 03 4E 46 4C 0B 00 00 00 6C 65 61 67.... +.label..xxxxxxxxxxx. 00001368: 72 0A 1B 66 6F 6F 74 62 61 6C 6C 3A 78 78 78 78 78 78...i +d..NFL....league_abb

    With those edits in place we can resurrect the original data (I think ;-) However it is fairly obvious that you have edited the binary to preserve the privacy of the user. I would just mention that you may well have corrupted the corruption :-)

      I've applied your binary patch and resurrected the original storable file so I know my privacy string edits (all the 'xxx..' and '000..' didn't have any affect on the resulting edited storable that I posted.

      I've pulled Storable 1.0.13 off of backpan and am currently preparing a to apply your "Bugger" patch. Thank you very much for you indepth analysis and I'll be sure to follow up if I can record a dump of the data structure that caused it in the first place. (that is if I can reproduce the corrupt storable with multiple attempts with Storable 1.0.13 and 2.13)

      I'll follow up on this thread when/if I can determine the actual cause. :P

      Thanks again!

      -biz-

        The interesting thing is that with the edits the restored data structure store/retrieves just fine! Forgot to mention that. There must be an extra byte or two in the original that is triggering the behaviour but where....and what.....

        Without a test case data structure it is like looking for a very small haystack in an enormous pile of needles! Did I mention I did not like the author's macro mania much yet ;-) As soon as you have a test case (and the croak will get you one provided you test and save it) it should be relatively easy to track down the issue(s)

        Storable is a module I use all the time, as do many others. We can all do without it randomly crashing. Good luck.

        cheers

        tachyon

        It was very thoughtful of you to edit the data. ++ for that. I presume you worked your way through it with a hex editor as I don't see how you could have done it with a regex. If you did manage to do it with REs please explain ;-)

      Wonderful analysis, tachyon. You are an Example and Inspiration to Us All!
        Hey, saw you nick figured it was a football reference. I actually grew up in Omaha and attended school at the University of Nebraska at Omaha.

        Go Big Red! :)

        -biz-

Re: Small Corrupted Storable Retrieve Dies With an Invalid "Out of memory!"
by mattriff (Chaplain) on Oct 27, 2004 at 16:53 UTC
    I feel silly following up tachyon's great analysis with something so mundane, but...

    On the subject of why the file gets corrupted, I recommend looking for file-locking bugs in the code that's calling Storable::store.

    As it happens, I ran in to this same exact problem last week, and it hasn't recurred since I fixed my file locking mechanism to actually, you know, prevent concurrent writes. ;)

    (My problems occurred using Perl 5.8.3 and Storable 2.13 on FreeBSD 4.10-STABLE, btw.)

      Hi Matt,

      You up you are quite possibly right about the cause. As noted it looks like a buffer and offset issue. What I forgot to mention is that the restored data structure after the 8 bytes of edit stores and retrieves just fine. It could be that there are no missing bytes (as I speculated above) but that it is simply a locking issue as you suggest. Catching the croak and the datastructure will prove that either way. It will be interesting to find out.

      cheers

      tachyon