tj_thompson has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I've run into a bug that I'm having a very difficult time locating that I'm hoping you can provide some guidance on.

I have a test that is producing the following error:

*** glibc detected *** perl: corrupted double-linked list
I suspect an XS module I'm using is doing something bad. I've done a fair bit of research into this and tried these approaches at isolating the bug:

1) set the MALLOC_CHECK_ environment variable. There was some success here as the error changed to this:

*** glibc detected *** perl: realloc(): invalid pointer
2) I tried to run this under gdb to hopefully get a stack trace so I can find the offending function. I have done this in the past to debug seg faults successfully:
> gdb perl (gdb) set args <args> (gdb) run
This time I got no additional information. The invalid pointer error is still thrown (assuming MALLOC_CHECK_ is set, otherwise it's the linked list error) with no additional stack trace information. I'm assuming this is due to a lack of debugging symbols.

3) I have tried to run this under valgrind. Initially, I got a valgrind error that VG_N_SEGMENTS was too low. I recompiled valgrind with a higher limit here. The code now passes entirely every time I run it under valgrind.

So I seem to have no additional useful information from my limited C debugging skill set.

Can any kind monk offer some guidance on what next steps I might take to debug this issue? I would offer code but it's far too complex without being able to localize the issue, so I apologize on that front.

Thanks in advance for your time, it's much appreciated :)

EDIT: I suppose some version information would be good. This is red hat enterprise linux 10 and perl 5.12.2. I can give any other information that would be useful.

EDIT2: Make that SUSE Linux Enterprise Server 10 (x86_64). Not sure where I got red hat from as the last time I did anything with that was years ago in school.

############################################################################

I managed to resolve the issue and wanted I'd post the details here just in case it may be useful to someone else.

As it turned out, the problem was my assumption that an XS module was at fault. I was basing this on the error and that the problem was occurring pretty close to some of my own XS code. Turns out the problem was a simple perl side issue.

The clue that I missed was that the code was taking significantly longer to execute than it should have and that I was getting some output that mentioned a memory wrap after the error from above.

The real issue was a bad (very large) loop variable outside the XS code that was causing the process to allocate all its memory, resulting in perl throwing the memory error above.

I am unclear on why the code would complete under valgrind. I do know that once I realized the issue and started watching memory, the process was consuming upwards of 50G of memory. I am also unclear why I didn't get any point of failure information from gdb...although once you're out of memory I suppose all bets are off.

Thanks to all for the great insights, especially the information about MALLOC_CHECK_ values and the valgrind options. I appreciate it!

Replies are listed 'Best First'.
Re: *** glibc detected *** perl: corrupted double-linked list
by hippo (Archbishop) on Nov 09, 2015 at 21:00 UTC
    This is red hat enterprise linux 10

    I don't know what you meant to type but I do know that it wasn't that. :-)

    You've done all the things I would have attempted and more. The only suggestion to offer is the one with most work - keep stripping down your code to a smaller and smaller example until you can isolate the flaw. Good luck.

      Ah man, you're right. Not sure what I was thinking at the time. It's SUSE Linux Enterprise Server 10 :)
Re: *** glibc detected *** perl: corrupted double-linked list
by Anonymous Monk on Nov 09, 2015 at 22:28 UTC

    What value did you give to the MALLOC_CHECK_ ? Try assigning e.g. MALLOC_CHECK_=3, so that SIGABRT is raised. You should get a stack trace in gdb regardless of debugging symbols. ("where").

    Do I read this correctly: in valgrind your program runs successfully to completion? In other words, a heisenbug? Valgrind has some options to play with. You can try --malloc-fill=.. and --free-fill=.. to poison the chunks.

    Are you sure your modules are correctly linked? Valgrind by default only hooks dynamically linked calls (to libc.so).

    Do you make use of threads?

      I had no idea you could assign differing values to MALLOC_CHECK_. I was using 1. I'll have to do some more digging there for future reference.

      Yeah, oddly enough the test would complete under valgrind. I haven't seen that before, but I've very rarely tried to use valgrind. I'll check out the options you mention.

      I have since resolved the issue and will post the less than flattering details of my debugging above in the hope it may help someone in the future :)

Re: *** glibc detected *** perl: corrupted double-linked list
by u65 (Chaplain) on Nov 09, 2015 at 20:54 UTC

    Can you give us a list of the XS modules you use? Have you looked at the XS modules on CPAN to see if there are existing bugs causing similar errors?

      I definitely went looking for similar bugs. I found a few, but they didn't seem to be similar to what I was seeing. As far as XS goes, I have some custom XS in there and this was breaking near the point where I start to use it, so I assumed it was my own code causing the problem.

      It turned out the issue wasn't XS and my assumption there led me down the wrong rabbit hole quite a ways before I realized my mistake. I updated the post above.

        Glad you found the problem! Thanks for the update. -Tom

Re: (resolved) *** glibc detected *** perl: corrupted double-linked list
by Anonymous Monk on Nov 11, 2015 at 12:20 UTC

    Although this topic is now marked as resolved, I'd like to take the opportunity to remind the attentive reader.

    It is generally not considered acceptable for well-behaving programs to crumple under load with symptoms of memory corruption. Indeed, grabbing resources in an endless loop can serve as a useful stress-test for validating the robustness of your utility.

    One common cause of such errors is not testing the return value of malloc()/realloc(), and blindly assuming they always succeed. See also: canthappen.pdf.