in reply to system >> 8 is non-zero when child exits with exit(0)

once >>8 is applied to the value returned by system

First, note you should be checking the whole return value of system, not just the upper byte. If the call was successful, its whole return value should be zero, so you might actually be missing error conditions. Once the return value of system is nonzero, then you can inspect it, as the docs show. You can also consider using IPC::System::Simple, which provides a drop-in replacement for system with nicer error handling.

I've been unable to either reliably reproduce the behaviour (it just happens sometimes in processes running on the server) nor stop it happening. ... I've not been able to find a known bug

IMHO, I would consider it likely then that it's actually the subprocesses that is in fact failing somehow. Adding detailed logging to the process doing the system would probably help in hunting down the problem. For that, consider also capturing the output of the process being run, especially its STDERR, using e.g. Capture::Tiny.

Replies are listed 'Best First'.
Re^2: system >> 8 is non-zero when child exits with exit(0)
by bjdean (Novice) on Mar 09, 2021 at 14:56 UTC

    Thanks for your thoughts!

    I have been checking the full value returned from system as well - I should have mentioned that.

    This is a new one on me - I've not seen it in 20 something years of working with perl. I agree that there's something happening in the child process (ie I'm quite sure the return value of system is correct), however the very last thing to happen in these failing processes is basically:

    logprint("Process completed successfully"); exit(0); # As part of testing we've even added exit with an explicit 0 + as here

    And yet the calling processing detects a non-zero return code - and (having been logging this return code in detail) it is always a return code of 1 (after >>8) or to quote some debugging output we've added:

    DEBUG TEST 1 uses the if/then/else described in perldoc -f system: DEBUG TEST 1 CASE(else): child exited with value 1 DEBUG TEST 2 uses the POSIX::W* checks of perlvar ${^CHILD_ERROR_NATIV +E} which is: 256 (binary: 0000000100000000) DEBUG TEST 2: POSIX::WIFEXITED returned true (child exited normally) DEBUG TEST 2: POSIX::WEXITSTATUS returned: 1 (binary: 0000000000000001 +)

      That's somewhat strange.

      the very last thing to happen in these failing processes is basically:

      Are you sure about that? I would guess that you're seeing your "Process completed successfully" message in your logs, but can logprint() fail in any other way? Have you tried capturing the process's STDOUT and STDERR to inspect that for any messages?

      Another thought: What does your system call look like? Is it a single string, as in system("/path/to/script.pl")? Are you using any shell features in the system call, e.g. system("/path/to/script.pl | grep ...")? Because in that case something could theoretically be going wrong with the shell (The problem of "the" default shell).

        the very last thing to happen in these failing processes is basically:
        Are you sure about that? I would guess that you're seeing your "Process completed successfully" message in your logs, but can logprint() fail in any other way? Have you tried capturing the process's STDOUT and STDERR to inspect that for any messages?

        Am I sure? 99.9% yes - this particular logging function is very simple and on the local filesystem where there are no other issues (and plenty of other code succeeding doing this exact same thing most of the time).

        You make an interesting point about the output! While we monitor STDOUT/STDERR in some cases we also throw it away in others so I've added some extra monitoring there.

        Another thought: What does your system call look like? Is it a single string, as in system("/path/to/script.pl")?

        Yep - in this case it's a string passed to the shell, so the shell problem does come into play so I'll have a look at that angle in a little more detail.

        What does seem to be consistent is that where a single script is making multiple system calls, when the problem happens it keeps happening to that script (so some sort of state somewhere is going squiffy). But you I log into the same server at the same time, run the same code and the problem is not reproducible (the best kind of bug, if you're a bug!).