pango has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a rather complex POE program that runs into a segfault every now and then, debugging it has been rather difficult for me since I cannot reproduce it reliably due to the nature of POE.

For more context, this POE program was running fine in an older version of perl (around v5.10.1) in an older Gentoo Linux distribution. Recently, we were forced to upgrade to a newer environment using Oracle Linux 8 using perl v5.26.3. Granted, its also a newer kernel and core libs.

This particular POE program uses Net::Curl and Net::Curl::Multi to handle many network requests and run tasks around them.

I am able to get a core dump from this segmentation fault (which happens anywhere between 10 minutes and 3 hours after starting the program), and the stacktrace always look like this:

Stack trace of thread 680482: #0 0x00007ffa7d510df8 Perl_csighandler (libperl.so.5.26) #1 0x00007ffa7d239d80 __restore_rt (libpthread.so.0) #2 0x00007ffa7c3b8b41 __poll (libc.so.6) #3 0x00007ffa7d017e79 __res_context_send (libresolv.so.2) #4 0x00007ffa7d0157cf __res_context_query (libresolv.so.2) #5 0x00007ffa7d015e76 __res_context_querydomain (libresolv.so.2) #6 0x00007ffa7d01646d __res_context_search (libresolv.so.2) #7 0x00007ffa70d6715e _nss_dns_gethostbyname4_r (libnss_dns.so.2) #8 0x00007ffa7c3acb9e gaih_inet.constprop.6 (libc.so.6) #9 0x00007ffa7c3ade5b getaddrinfo (libc.so.6) #10 0x00007ffa787fe38b Curl_getaddrinfo_ex (libcurl.so.4) #11 0x00007ffa78809383 getaddrinfo_thread (libcurl.so.4) #12 0x00007ffa78806a3f curl_thread_create_thunk (libcurl.so.4) #13 0x00007ffa7d22f1da start_thread (libpthread.so.0) #14 0x00007ffa7c2bf8d3 __clone (libc.so.6) Stack trace of thread 608560: #0 0x00007ffa7c321139 __malloc_fork_unlock_parent (libc.so.6) #1 0x00007ffa7c38e33d __libc_fork (libc.so.6) #2 0x00007ffa7d583d82 Perl_pp_fork (libperl.so.5.26) #3 0x00007ffa7d528315 Perl_runops_standard (libperl.so.5.26) #4 0x00007ffa7d568941 S_docatch (libperl.so.5.26) #5 0x00007ffa7d528315 Perl_runops_standard (libperl.so.5.26) #6 0x00007ffa7d49ff2d Perl_call_sv (libperl.so.5.26) #7 0x00007ffa78e7914b poe_data_ev_dispatch_due (EPoll.so) #8 0x00007ffa78e78420 lp_loop_run (EPoll.so) #9 0x00007ffa7d5304a9 Perl_pp_entersub (libperl.so.5.26) #10 0x00007ffa7d528315 Perl_runops_standard (libperl.so.5.26) #11 0x00007ffa7d4a810f perl_run (libperl.so.5.26) #12 0x00005635e9200eda main (perl) #13 0x00007ffa7c2c08a5 __libc_start_main (libc.so.6) #14 0x00005635e9200f1e _start (perl)

When trying to debug further in perl code however, I see that this segfault always happens inside a response handler in POE that runs a tar command (basically, after some data is downloaded, we fork using POE::Wheel::Run to extract the archive):

warn("Data has been downloaded"); warn("Running tar"); # Here is where the segfault always happens my $child = POE::Wheel::Run->new( Program => [ '/bin/tar', # tar args omitted ], StdoutEvent => "wheel_stdout", StderrEvent => "wheel_stderr", );

Given where it happens in the perl code I initially assumed it was something to do with Wheel::Run or tar, but changing that workflow does not fix the issue. Looking at the core dump I am thinking its more todo with libcurl, but I am stumped as to what could cause it to segfault like that

Replies are listed 'Best First'.
Re: POE program running into sporadic segfault
by etj (Priest) on Aug 28, 2024 at 19:58 UTC
    What would really help here is to have debugging symbols on your libc.so and libperl.so, so you could do an "l" and see what exact code was being run in those crashing functions, with what values. On Debian derivatives that should be achievable by installing the appropriate packages.

    A quick surmise from the first stack-trace is that Net::Curl appears to be pthreading to do its thing, then calling a Perl callback. Perl in this context may or may not be thread-safe, which would explain crashes. You haven't published an SSCCE so we can't really tell more than that. You should update to include one.

      I apologize for not being able to provide an SSCCE, as this segfault seems to happen in some multi-threaded context that is hard to reproduce for me at the moment.

      Looking again at the stacktrace, I took a look at libcurl in this environment and realized that it is a pretty old version of libcurl (7.61) even though it is the standard libcurl for Oracle Linux 8 (supposedly still supported). I've attempted to rebuild and install libcurl from source to a newer version and use that instead.
Re: POE program running into sporadic segfault
by NERDVANA (Priest) on Aug 29, 2024 at 02:26 UTC
    That looks like Curl created its own thread (in a non-threaded perl?) and then sent a signal and Perl's signal handlers ran in the context of Curl's thread instead of Perl's main thread. This is definitely a bug of some sort because Perl's signal handlers must always run in (one of) Perl's threads.

    I don't know what exactly the solution is here, because I don't know the inner workings of libcurl, but I suspect you need to force curl's thread to block all signals.

    It's an amazing coincidence you would ask this today, because I *just* fixed the same kind of bug last night in my module IO::SocketAlarm. In that code, I was creating a second thread unknown to perl and sending a signal with the intent that Perl's main thread would catch it. It worked on Linux, but on FreeBSD the thread would catch its own signal (using Perl's signal handlers which must be run in the main thread) and die with a segfault, just like your code.

    Assuming you don't want to dig into the XS of the module that gives you libcurl, the next-best way to solve the problem is to block all signals in the main thread prior to starting your curl operations, then unblock afterward. New threads inherit a copy of the signal mask, so then that curl thread will have all signals blocked even after you re-enable them in the main thread. (and you need them enabled in the main thread to catch things like SIGCHLD which wakes POE up to reap child processes)

    There's a chance I'm wrong here if libcurl expects to be able to receive signals in its thread as part of normal operation. If that's the case, I don't know what the solution is, other than maybe moving the libcurl stuff to a separate process.

    Edit: Or, try using a threaded perl. A threaded perl I think would have to account for the signal handler running in a random thread, so when compiled for threads it probably uses appropriate synchronization techniques to deliver the signal.

    Update: Looking at my code made me realize I should take my own advice and change the signal mask before starting the thread, to avoid a tiny race condition of the thread receiving a signal before it runs pthread_sigmask.

      To add to these excellent points, and possibly to help repro the crash: the signals being sent to the thread case will (I believe) almost certainly be a SIGALRM, to implement a timeout (and such wouldn't be maskable from outside since it's fundamental to how I assume curl works). Therefore, to repro it, you would need a web service that reliably times out. Test::Mojo is very helpful in creating such things.
        My money is still on SIGCHLD, since OP says this seems related to the act of shelling out to tar. SIGALRM is kind of an old-school Unix design, where I expect most modern event-driven libraries will be using select() or poll() with the built-in timeout parameter. SIGCHLD is still very actively used to break out of one of those blocking poll() sleeps when it's time to reap a child process.
Re: POE program running into sporadic segfault
by jeffenstein (Hermit) on Aug 30, 2024 at 06:53 UTC

    Maybe you could also run Net::Curl in a child with POE::Wheel::Run, and pass the result back on stdout with POE::Filter::Stream/Reference? This should at least prevent extra threads in the main process.