in reply to Re^6: baton passing threads and cond_signal
in thread baton passing threads and cond_signal

If first thread signals, but gets interrupted before it can loop back to re-lock the variable and reenter the wait

If that's the problem (and I can't see what else it could be), then all you have to do is change

while (1) { lock ($baton); cond_wait ($baton) until $baton == $id; ... cond_broadcast ($baton); }
to
lock ($baton); while (1) { cond_wait ($baton) until $baton == $id; ... cond_broadcast ($baton); }

While you should do that change, I don't think that's the problem. The only time where missing the signal would cause a problem is if it's sent between the time $baton == $id is checked and the time cond_wait blocks. The purpose of locking $baton is to create the mutual exclusion that should prevent this from happening.

The problem might be that your system's implementation of cond_wait isn't atomic (while it should be), allowing a signal to come in after cond_wait unlocks $baton, but before cond_wait starts waiting.

You could give yourself a safety net by using cond_timedwaitcond_wait with a timeout — to check $baton periodically.

Update: Did some repharsing and added the second last paragraph.

Replies are listed 'Best First'.
Re^8: baton passing threads and cond_signal
by Anonymous Monk on Aug 22, 2007 at 16:25 UTC

    If one thread maintains a persistant lock on the variable, how would the other thread ever obtain a lock? Without obtaining a lock, cond_wait doesn't work.

      There's no persistant lock. cond_wait releases the lock, allowing other threads to cond_broadcast. cond_wait re-obtains the lock on awakening.

      You should reread my node. I added this critical paragraph:

      The problem might be that your system's implementation of cond_wait isn't atomic (while it should be), allowing a signal to come in after cond_wait unlocks $baton, but before cond_wait starts waiting.

        The problem might be that your system's implementation of cond_wait isn't atomic

        Which would be a bug worth reporting. An easy way to verify this would be to run the code above without the yields on single cpu Linux system and see if the problem occurs there?