gnork has asked for the wisdom of the Perl Monks concerning the following question:

Problem has been solved

Dear monks,

I have some scripts runnning as server processes. The workload has increased, thats why I would like to switch from serial processing to "fork after accept".

To guarantee every child his own DB/RPC Handles, new connections are made in every child process.
Child sockets are closed in the parent, parent sockets are closed in the childs.
sub new_client { my $sock = $main_socket->accept(); my $pid = fork(); unless( defined $pid && $pid == 0 ) { $sock->close; } if ($pid == 0) { $main_socket->close; # close main socket in child #process_request . . } }
Remark: code is derived from Msg.pm (Advanced perl programming) and the codebase for both cases is IDENTICAL except for one sub (to be found in readmore).
Under certain circumstances (reproducible by calling the variant 1 before server start) I end up with bad file decriptors in the child processes, while variant 2 works. accept/fork are done right, but select() on the filehandles for the incoming rpc connection returns EBADF. The parent is still alive and doesn't close any file descriptor it should not.

Warning, lots of stuff to read

package MyStuff::Database; %db_conns = (); sub _DBconnect_internal { my ($dsn, $user, $password, $host, $tag, $options) = @_; if (not exists $db_conns{"$tag:initialized"}) { $db_conns{"$tag:data"}->{conn} = [$dsn, $user, $password, $host]; + $db_conns{"$tag:initialized"} = 1; &Log("DB Handle $tag ready for connect.") if $_debug > 5; return; } #Connecting happens here . . . } ##END OF PACKAGE use MyStuff::Database qw(%db_conns _DBconnect_internal _DoQuery); Variant 1 sub DBConnect { _DBconnect_internal('DBI:mysql:database=test;host=localhost;port=', +'foo', 'bar', 'TestDB', 'mysql'); } . . &DBConnect; &server_start; . . Variant 2 use MyStuff::Database qw(%db_conns _DBconnect_internal _DoQuery); sub DBConnect { $MyStuff::Database::db_conns{"mysql:data"}->{conn}= ['DBI:mysql:data +base=test;host=localhost;port=', 'foo', 'bar']; $MyStuff::Database::db_conns{"mysql:initialized"} = 1; } . . &DBConnect; &server_start; . .

My head is about to explode. Problem is, if I start the server calling the variant 2, everything works fine.
Calling variant 1 leads to EBADF in select();
The strange thing is both variants do the same thing, stuff the connect data into %db_conns in the parent process. NO handles are created or connected at this stage. When calling _DBconnect_internal again in the child process, it actually connects to the datasource using that data. I cant see anything concerning filehandles there.

For compatibility reasons I have to stick with variant 1.

These are the strace excerpts:
Working fork: clone(Process 22290 attached (waiting for parent) Process 22290 resumed (parent 22276 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, +child_tidptr=0xb7d4e908) = 22290 [pid 22290] close(3 <unfinished ...> [pid 22276] close(4 <unfinished ...> [pid 22290] <... close resumed> ) = 0 [pid 22276] <... close resumed> ) = 0 [pid 22276] select(8, [3], NULL, NULL, NULL <unfinished ...> [pid 22290] open("/usr/share/locale/en_US.ISO-8859-15/LC_MESSAGES/libc +.mo", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 22290] open("/usr/share/locale/en_US.iso885915/LC_MESSAGES/libc.m +o", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 22290] open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDON +LY) = -1 ENOENT (No such file or directory) [pid 22290] open("/usr/share/locale/en.ISO-8859-15/LC_MESSAGES/libc.mo +", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 22290] open("/usr/share/locale/en.iso885915/LC_MESSAGES/libc.mo", + O_RDONLY) = -1 ENOENT (No such file or directory) [pid 22290] open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) + = -1 ENOENT (No such file or directory) [pid 22290] stat64("/dev/log", {st_mode=S_IFSOCK|0666, st_size=0, ...} +) = 0 [pid 22290] socket(PF_FILE, SOCK_STREAM, 0) = 3 [pid 22290] ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbffb3cb8) = -1 E +INVAL (Invalid argument) [pid 22290] _llseek(3, 0, 0xbffb3cf0, SEEK_CUR) = -1 ESPIPE (Illegal s +eek) [pid 22290] ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbffb3cb8) = -1 E +INVAL (Invalid argument) [pid 22290] _llseek(3, 0, 0xbffb3cf0, SEEK_CUR) = -1 ESPIPE (Illegal s +eek) [pid 22290] fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 [pid 22290] connect(3, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0 [pid 22290] time(NULL) = 1184077806 [pid 22290] time(NULL) = 1184077806 [pid 22290] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22290] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22290] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22290] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22290] select(8, [3], NULL, [3], {0, 0}) = 0 (Timeout) [pid 22290] write(3, "<13>Jul 10 16:30:06 rpcsrv[22290"..., 66) = 66 [pid 22290] time(NULL) = 1184077806 [pid 22290] getpeername(4, {sa_family=AF_INET, sin_port=htons(51130), +sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 [pid 22290] getpeername(4, {sa_family=AF_INET, sin_port=htons(51130), +sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 [pid 22290] select(8, [3 4], NULL, NULL, NULL) = 1 (in [4]) [pid 22290] read(4, "\0\0\0$", 4) = 4 [pid 22290] read(4, "FrT;@5|$1|>$1|1$1|s$9|main::inc$"..., 36) = 36 . . . .

This is from another process using DBConnect_not_ok before server start
Not working clone(Process 22302 attached (waiting for parent) Process 22302 resumed (parent 22300 ready) child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, +child_tidptr=0xb7d1b908) = 22302 [pid 22302] close(4 <unfinished ...> [pid 22300] close(5 <unfinished ...> [pid 22302] <... close resumed> ) = 0 [pid 22300] <... close resumed> ) = 0 [pid 22300] select(8, [4], NULL, NULL, NULL <unfinished ...> [pid 22302] time(NULL) = 1184077872 [pid 22302] time(NULL) = 1184077872 [pid 22302] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22302] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22302] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22302] stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=22 +95, ...}) = 0 [pid 22302] select(8, [3], NULL, [3], {0, 0}) = 0 (Timeout) [pid 22302] write(3, "<13>Jul 10 16:31:12 rpcsrv[22302"..., 66) = 66 [pid 22302] time(NULL) = 1184077872 [pid 22302] getpeername(5, {sa_family=AF_INET, sin_port=htons(51131), +sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 [pid 22302] getpeername(5, {sa_family=AF_INET, sin_port=htons(51131), +sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 [pid 22302] select(8, [4 5], NULL, NULL, NULL) = -1 EBADF (Bad file de +scriptor) . . .
Certain (failing) ioctl calls are missing in the second strace log, thats what I can see from that. Why is beyond my actual understanding.
I tried to find a solution for almost a week now, Im out of options. Thanx for any hint on how to solve that.

Best regards,
gnork

cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
errors->(c)

Replies are listed 'Best First'.
Re: Mysterious bad file descriptor after fork
by sgt (Deacon) on Jul 11, 2007 at 08:24 UTC

    Well you did not tell us the OS. First I would look at the conditions to get EBADF error conditions with select. Probably you got some error condition that invalidate read or write on the "latest selected desc".

    Then you seem to be using a mixed scheme with an accept-fork loop and a select loop. usually I tend to use one or the other i.e

  • accept+fork server and deal with the request in each child, need to take care of communication back to the father (pre-forked schemes are useful too especially in conjunction with DBI).
  • polling server (select loop) you add a correct "accept"ed desc to your set of valid desc and let the OS poll for you. if you don't trust your clients it's probably better to go non-blocking which means you need to handle partial reads. To do that correctly you need something like Lincoln Stein's IO::SessionData (part of SOAP::Lite), IO::Multiplex is useful too. If the OS is linux some even loop based on epoll could be a nice way to handle things. There is a module based on libevent.
  • there are threading models too, but I don't use that in production (yet)
  • So I don't understand your grand scheme of things...can you explain a bit more. In any case the "sharing" of data between father and child you seem to imply seems suspicious to me: it can only be a "one-go" thing anyway, why not do the necessary init in the child anyway?

    One basic rule is also that given a desc opened (with the close-on-exec flag not set) in the father, only the child or the father must use it at any time, meaning better to close it in the father if meant for the child. Are you sure your code comply with that? Can you post a representative snippet we could try? If you use a file instead of a db, does it work?

    You could also have a look to Net::Server.

    cheers --stephan
      Yippie, after one week, and loud thinking while writing on perl monks I finally found the problem.

      In the event_loop a select() iterates over two arrays/IO::Select objects ($rd_handles, $wt_handles).

      After server start $rd_handles contains only the "$main_socket" and the parent selects only for that socket. After forking, the sockets are closed (child socket in parent, main_socket in child) BUT the $main_socket remained in $rd_handles of the child process. This caused the EBADF, since there was a closed fdesc in that array. It was always there, I dont understand why one variant worked when the other didn't. Strangely enough, the first step to correct that, $rd_handles->remove($main_socket) didn't work. The $main_socket was still in the array along with valid sockets and I still got EBADF from select().

      So I decided to undef $rd_handles/$wt_handles, recreate the two objects and add the child socket there.
      That fixed it. Thanks for your input, sometimes I have to think loud and none of my coworkers is suitable for that.

      As for the architecture... it is a combination of accept once / select over a set of sockets added to rd_/wt_handles after accept and callback procedures registered for each of these sockets. Now when a message arrives over a socket, the associated procedure is called and the result is send back to the client.

      It took me a while to understand the inner workings of Msg.pm (props to Mr. Sriram Srinivasan). The module wasn't designed with a forking solution in mind and in it's original form is not intended for production use (as stated by the author). For a long time I avoided messing with it, but now I had to. Learned a lot while doing so.

      Best regards
      gnork

      cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
      errors->(c)
Re: Mysterious bad file descriptor after fork
by almut (Canon) on Jul 10, 2007 at 19:20 UTC

    It's somewhat hard to tell without the full context ...  but just one idea to get the discussion going :)   Are you checking the return value of the accept() call?  In some circumstances, accept may return undef (without having gotten any connection attempt, that is). In case of EINTR ($! eq "Interrupted system call") you just have to call accept again. This is more of an issue on some platforms (e.g. AIX) than on others (Linux), because some may automatically resume interrupted system calls — but in case of doubt, be prepared to handle it yourself...   (Google for '"interrupted system call" EINTR accept' (or some such) for more background.)

    In any case, what I just said would only make sense if the problem is occurring sporadically (I'm not quite sure how to read your "Under certain circumstances"). If it reproducibly occurs with DBConnect_not_ok, then something else will likely be the problem ...

      "Certain circumstances" refers to calling DBConnect_no_ok instead of DBConnect_ok before server start. I edited my post to better reflect that.

      I will look into the return value of accept().

      Thanks for the hint, will report back.

      cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
      errors->(c)
        The return value of accept() is now checked and properly treated. The error persists, so it was not the root of the evil.

        Best rgds
        gnork

        cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
        errors->(c)