in reply to Re^4: baton passing threads and cond_signal
in thread baton passing threads and cond_signal

The problem may be in differences in thread implementation on unix and windows. If, for example, cond_broadcast() "nests" in one implementation and not in the other, you may get cascading broadcasts. i.e.

Say you've got 10 threads waiting on the same condition like in your code. When you broadcast, you'll wake up all threads one after another. But in the mean time, one of the threads (the one who's thread id matches the baton) will broadcast. So now you're waking up all threads again while you're not done waking up the rest of the threads, and so on. That might grind the program to a halt fairly quickly if the broadcasts do not terminate when a new broadcast is done on the same condition variable.

Now I'm not sure if that's what happens. If it is, I suspect you'd see the process taking up a lot of CPU time, while if the problem is some kind of deadlock, you'd see the process taking essentially no CPU time at all.

  • Comment on Re^5: baton passing threads and cond_signal

Replies are listed 'Best First'.
Re^6: baton passing threads and cond_signal
by Anonymous Monk on Aug 22, 2007 at 15:50 UTC

    This is very definitely a deadlock problem rather than cascading broadcasts. First, the cpu usage drops to 0, but also, I've been testing with a version that passes the baton back and forth between just 2 threads using cond_signal, and it exhibits exactly the same behaviour. Without I inject a some extra context swaps with yield, it hangs almost immediately.

    Having added some very lightweight tracing, I can see that the problem is that sometimes the signal or broadcast is just missed.

    This seems to be a consequence of this from the pod.

    If there are no threads blocked in a cond_wait on the variable, the signal is discarded.

    If first thread signals, but gets interrupted before it can loop back to re-lock the variable and reenter the wait, then the second thread will wake up, do its thing and then signal when there is no thread waiting, so the signal is discarded. Now both threads enter cond_wait and wait for a signal that will never come.

    I do not see any way around this?

    On a multi-cpu system the threads are less likely to be interupted before they reach the wait so the problem doesn't occur.

    I suspect that without the yields, the code above would also lock up quite quickly if run on a single processor under Linux.

    It may happen less frequently as the mutex and semphore used by cond_wait will be manipulated by highly tuned kernel code within each call to the pthreads libraries, rather than mutliple calls to several different kernel routines as is the case of the windows emulations.

    Less frequently, but it still leaves a latent bug waiting to occur.

    The problem remains one of how do you ensure that at least one thread in the broadcast version, or the other thread in the signal version, always get back to a cond_wait state, before the next thread to be woken, signals?

      If first thread signals, but gets interrupted before it can loop back to re-lock the variable and reenter the wait

      If that's the problem (and I can't see what else it could be), then all you have to do is change

      while (1) { lock ($baton); cond_wait ($baton) until $baton == $id; ... cond_broadcast ($baton); }
      to
      lock ($baton); while (1) { cond_wait ($baton) until $baton == $id; ... cond_broadcast ($baton); }

      While you should do that change, I don't think that's the problem. The only time where missing the signal would cause a problem is if it's sent between the time $baton == $id is checked and the time cond_wait blocks. The purpose of locking $baton is to create the mutual exclusion that should prevent this from happening.

      The problem might be that your system's implementation of cond_wait isn't atomic (while it should be), allowing a signal to come in after cond_wait unlocks $baton, but before cond_wait starts waiting.

      You could give yourself a safety net by using cond_timedwaitcond_wait with a timeout — to check $baton periodically.

      Update: Did some repharsing and added the second last paragraph.

        If one thread maintains a persistant lock on the variable, how would the other thread ever obtain a lock? Without obtaining a lock, cond_wait doesn't work.