phatDissonance has asked for the wisdom of the Perl Monks concerning the following question:

O keepers of infinite wisdom: I am working on a script which executes the following system call within a loop, iterations of which are counted by $iter:
goto CHARMM:; print "running charmm for iteration $iter with command: ~u0588832/bin/ +charmm < enm.calc.inp > charmm-out-iter-$iter.out\n"; system ("~u0588832/bin/c32b2/exec/gnu/charmm < enm.calc.inp > charmm-o +ut-iter-$iter.out"); if ($? == -1) { print "failed to execute: $!\n"; } elsif ($? & 127) { printf "child died with signal %d, %s coredump\n", ($? & 127), ($? & 128) ? 'with' : 'without'; } else { printf "child exited with value %d\n", $? >> 8; } rename ("ic-fluc.dat" , "ic-fluc-iter-$iter.dat"); open(IN, "ic-fluc-iter-$iter.dat") || die "could not open internal coo +rd fluc file.\n";
the odd thing is that the child process runs just fine many thousands of times, then fails. the code exits on the open when it can't find the .dat file, which is an output of the system call. some more info: (1) the code seems to crash randomly, ie, on a different iteration. almost never makes it past about 10,000 iterations. (2) the child process runs fine if i paste the stuff inside the quotes onto the command line once the code crashes. i.e, all the input files for the system call are in order. (3) the value of $? is 0, even when the code crashes. (4) if i add a snippet
if(!(-e "ic-fluc.dat")){ print "can't find ic-flc.dat at iter $iter. rerunning +charmm.\n"; goto CHARMM; }
before the rename, the code loops endlessly through the goto once the system call fails the first time. this is really driving me nuts. seeking transcendence, ASAP.

Replies are listed 'Best First'.
Re: strange failure of a system call
by jbert (Priest) on Nov 12, 2007 at 20:36 UTC
    Perhaps:

    • the child process always exits with 0, even if it fails to produce output
    • it fails unpredictably in this way (so your cut-and-paste works). Perhaps because of a bug or intermittent memory allocation failure due to system load?
    One way to get more information (although it may make the whole thing run so slowly as to not reproduce the problem usefully) would be to run your script under strace -f. When the script exits, you should see the exit code of the child process and also whether it died to to a signal or not.
Re: strange failure of a system call
by Anonymous Monk on Nov 12, 2007 at 22:32 UTC
    Probably would make sense to see try to trap the error at the rename and see if the filesystem is doing something weird and not keeping up:
    RENAME:{ rename ("ic-fluc.dat" , "ic-fluc-iter-$iter.dat") and last RENAME; print "retrying iter $iter\n"; sleep 1; redo RENAME; }
    If the file still isn't showing up, wrap the block around your system call to re-execute. -Greg