rovf has asked for the wisdom of the Perl Monks concerning the following question:

I create processes (on Windows) using something like $pid=system(1,'perl.exe "prog.pl" "x" "y" "z" 2>&1'); (i.e. the command always needs to invoke cmd.exe implicitly - I mention this because it might be relevant to the problem), and from time to time use waitpid($pid,WNOHANG) to poll this process. If waitpid returns -1, I assume that the process has been ended.

This works well so far. However, I now had once the following weird behaviour: waitpid returned -1, but the process terminated 8 seconds later (and actually did useful work during this 8 seconds). I found out because we do plausibility checks: The processes do some logging on their activities, and at the last step, the child process creates a certain file. The main process, after receiving -1 from waitpid, checks for the existence of this file, and issues an error message if this control file does not exist.

In this failure case, I see from the log file of the masterprocess, that at a certain time X waitpid had returned -1, but the control file did not exist, while the log file of the child process asserted that the control file had been created at time X+8 seconds.

From this I conclude that waitpid had signalled the end of the child process while this was still running. Now we know that in order to run the child process, a cmd.exe process must have been created (because the argument passed to system(1,...) uses shell quoting and redirection), and that cmd.exe then in turn creates a process for the command I had passed to system(). We also know that cmd.exe and its sub-process run in the same process group (at least this is how system(1,...) is supposed to work). The symptom I have experienced, looks as if cmd.exe had terminated earlier, before perl.exe "prog.pl".... was finished. This seems bizarre, and makes me a bit nervous. Any explanations for this?

-- 
Ronald Fischer <ynnor@mm.st>

Replies are listed 'Best First'.
Re: waitpid returns -1 for still running child (Windows)
by BrowserUk (Patriarch) on Jul 06, 2010 at 11:39 UTC

    I cannot reproduce your results with either 5.8.9 or 5.10.1:

    c:\test>\perl32\bin\perl.exe -wle"print $$, ':', $p=system 1, 'perl - +le\"print qq[??$$]; sleep 10; print q[exiting]\"'; print waitpid $p, +1 while sleep 1" 13960:14028 ??13984 0 0 0 0 0 0 0 0 0 0 exiting 14028 -1 -1 -1 Terminating on signal SIGINT(2)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I cannot reproduce your results ...
      Neither do I. That's the problem. I would be much happier if I could reproduce it, but in general, it runs well. I have found one case so far, where it failed, but this case is documented well. It seems to be a rare case, so I hoped someone with good Windows knowledge would know of some exotic circumstance where this problem might show up on Windows.

      -- 
      Ronald Fischer <ynnor@mm.st>
        I hoped someone with good Windows knowledge would know of some exotic circumstance where this problem might show up on Windows.

        How is the logging done? Ie. Are they simple prints to the respective log files directly from the applications, or are you using some centralised logging mechanism via pipes or sockets?

        How & where are the timestamps added?

        As for tracking it down more thoroughly, I'd start off by creating a simple UDP logging deamon and have both (all) apps send their messages to that. I'd time stamp them at both ends.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.