psini has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I experienced a strange problem this morning on a production server running a web service program based on Net::Server::Prefork.

Disclaimer: I can't reproduce the problem so I can't post the relevant code (the entire program is about 20k lines long), so my question is a much general one: "what could have happened?" or, better, "what could I monitor to have more info next time it happens?".

Now for what happened. This morning one of my customers called saying that the program was down. I connected to his server and found that the deamon was not running; a quick search through syslog gave the following result:

Jun 16 09:50:55 lxinf15 data_server[17854]: 2008/06/16-09:50:55 CONNEC +T TCP Peer: "127.0.0.1:58684" Local: "127.0.0.1:9999" Jun 16 09:50:55 lxinf15 data_server[14582]: Starting "1" children Jun 16 09:50:55 lxinf15 data_server[18578]: Child Preforked (18578) Jun 16 09:50:55 lxinf15 data_server[17854]: Parent process gone away. +Shutting down Jun 16 09:50:55 lxinf15 data_server[17257]: 2008/06/16-09:50:55 CONNEC +T TCP Peer: "127.0.0.1:58685" Local: "127.0.0.1:9999" Jun 16 09:50:55 lxinf15 data_server[17257]: Parent process gone away. +Shutting down Jun 16 09:50:56 lxinf15 data_server[17494]: 2008/06/16-09:50:56 CONNEC +T TCP Peer: "127.0.0.1:58687" Local: "127.0.0.1:9999" Jun 16 09:50:56 lxinf15 data_server[17494]: Parent process gone away. +Shutting down Jun 16 09:50:57 lxinf15 data_server[14048]: 2008/06/16-09:50:57 CONNEC +T TCP Peer: "127.0.0.1:58688" Local: "127.0.0.1:9999" Jun 16 09:50:57 lxinf15 data_server[14048]: Parent process gone away. +Shutting down Jun 16 09:50:57 lxinf15 data_server[1518]: 2008/06/16-09:50:57 CONNECT + TCP Peer: "127.0.0.1:58689" Local: "127.0.0.1:9999" Jun 16 09:50:57 lxinf15 data_server[1518]: Parent process gone away. S +hutting down Jun 16 09:51:01 lxinf15 /USR/SBIN/CRON[18585]: (www-data) CMD (/usr/bi +n/php4-cgi -q /var/systes/Sister/www_sister/pages/rapporti/RapportiBa +tch.php) Jun 16 09:51:04 lxinf15 data_server[18578]: 2008/06/16-09:51:04 CONNEC +T TCP Peer: "127.0.0.1:58691" Local: "127.0.0.1:9999" Jun 16 09:51:04 lxinf15 data_server[18578]: Parent process gone away. +Shutting down Jun 16 09:51:15 lxinf15 data_server[17663]: Parent process gone away. +Shutting down

data_server is my deamon process (yes, my names are always that original); it seems that at 09:50:55 the server received a connection, spawned a child (PID=18578) and then silently died. In the next 20 seconds the children died consequently.

What I don't understand is why the deamon died and why there is no trace in the logs of it's death.

This server has been in production for more than a month, serving several thousand calls every day, and his brother (at another location) has been up three months with a network load at least double. Not to count development and test servers... And I never had such a problem before.

I'm totally baffled, does anybody have a faint idea of what can I try?

Careful with that hash Eugene.

Replies are listed 'Best First'.
Re: Problem with Net::Server::Prefork - Server died w/o apparent reason
by TGI (Parson) on Jun 16, 2008 at 19:12 UTC

    The first thing that came to mind was an untrapped SIGPIPE, but it appears that the Net::Server::PreFork handles that for you.

    Could there be another fatal signal getting sent on your system?

    Here's a list of default actions for signals on linux systems taken from kernel/signal.c (kernel 2.6.21.1).

    +--------------------+-----------------+ | POSIX signal | default action | +--------------------+-----------------+ | SIGHUP | terminate | | SIGINT | terminate | | SIGQUIT | coredump | | SIGILL | coredump | | SIGTRAP | coredump | | SIGABRT/SIGIOT | coredump | | SIGBUS | coredump | | SIGFPE | coredump | | SIGKILL | terminate(+) | | SIGUSR1 | terminate | | SIGSEGV | coredump | | SIGUSR2 | terminate | | SIGPIPE | terminate | | SIGALRM | terminate | | SIGTERM | terminate | | SIGCHLD | ignore | | SIGCONT | ignore(*) | | SIGSTOP | stop(*)(+) | | SIGTSTP | stop(*) | | SIGTTIN | stop(*) | | SIGTTOU | stop(*) | | SIGURG | ignore | | SIGXCPU | coredump | | SIGXFSZ | coredump | | SIGVTALRM | terminate | | SIGPROF | terminate | | SIGPOLL/SIGIO | terminate | | SIGSYS/SIGUNUSED | coredump | | SIGSTKFLT | terminate | | SIGWINCH | ignore | | SIGPWR | terminate | | SIGRTMIN-SIGRTMAX | terminate | +--------------------+-----------------+ | non-POSIX signal | default action | +--------------------+-----------------+ | SIGEMT | coredump | +--------------------+-----------------+

    If you aren't running linux, your vendor should have a similar list available.


    TGI says moo

      Yes, I'm running linux (Debian Sarge) and I don't know of anything on the system that could signal to my process.

      Somebody suggested that the kernel could have SIGKILLed me if I ran out of memory but I don't really beleive this is the case for the server has plenty of memory and if my program had a memory leak I should have found it before (the other server has less memory, more load and a longer uptime).

      Moreover, AFAIK on a SIGKILL Net::Server should write in syslog that it is shutting down, and it didn't.

      Careful with that hash Eugene.

Re: Problem with Net::Server::Prefork - Server died w/o apparent reason
by jethro (Monsignor) on Jun 16, 2008 at 19:39 UTC
    I wouldn't be too concernced as long as it happens only once. Hardware is bound to fail from time to time, if not from power spikes or induced noise then from natural radiation that will flip random bits in your memory from time to time.

      Yes, it is what I said to my customer :)

      But I would like to be prepared if it happens again.

      Careful with that hash Eugene.