Re^5: baton passing threads and cond

The problem may be in differences in thread implementation on unix and windows. If, for example, cond_broadcast() "nests" in one implementation and not in the other, you may get cascading broadcasts. i.e.

Say you've got 10 threads waiting on the same condition like in your code. When you broadcast, you'll wake up all threads one after another. But in the mean time, one of the threads (the one who's thread id matches the baton) will broadcast. So now you're waking up all threads again while you're not done waking up the rest of the threads, and so on. That might grind the program to a halt fairly quickly if the broadcasts do not terminate when a new broadcast is done on the same condition variable.

Now I'm not sure if that's what happens. If it is, I suspect you'd see the process taking up a lot of CPU time, while if the problem is some kind of deadlock, you'd see the process taking essentially no CPU time at all.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Comment on Re^5: baton passing threads and cond_signal

Replies are listed 'Best First'.
Re^6: baton passing threads and cond_signal by Anonymous Monk on Aug 22, 2007 at 15:50 UTC
This is very definitely a deadlock problem rather than cascading broadcasts. First, the cpu usage drops to 0, but also, I've been testing with a version that passes the baton back and forth between just 2 threads using cond_signal, and it exhibits exactly the same behaviour. Without I inject a some extra context swaps with yield, it hangs almost immediately. Having added some very lightweight tracing, I can see that the problem is that sometimes the signal or broadcast is just missed. This seems to be a consequence of this from the pod. If there are no threads blocked in a cond_wait on the variable, the signal is discarded. If first thread signals, but gets interrupted before it can loop back to re-lock the variable and reenter the wait, then the second thread will wake up, do its thing and then signal when there is no thread waiting, so the signal is discarded. Now both threads enter cond_wait and wait for a signal that will never come. I do not see any way around this? On a multi-cpu system the threads are less likely to be interupted before they reach the wait so the problem doesn't occur. I suspect that without the yields, the code above would also lock up quite quickly if run on a single processor under Linux. It may happen less frequently as the mutex and semphore used by cond_wait will be manipulated by highly tuned kernel code within each call to the pthreads libraries, rather than mutliple calls to several different kernel routines as is the case of the windows emulations. Less frequently, but it still leaves a latent bug waiting to occur. The problem remains one of how do you ensure that at least one thread in the broadcast version, or the other thread in the signal version, always get back to a cond_wait state, before the next thread to be woken, signals?	[reply]
Re^7: baton passing threads and cond_signal by ikegami (Patriarch) on Aug 22, 2007 at 16:08 UTC
If first thread signals, but gets interrupted before it can loop back to re-lock the variable and reenter the wait If that's the problem (and I can't see what else it could be), then all you have to do is change `while (1) { lock ($baton); cond_wait ($baton) until $baton == $id; ... cond_broadcast ($baton); }` [download] to `lock ($baton); while (1) { cond_wait ($baton) until $baton == $id; ... cond_broadcast ($baton); }` [download] While you should do that change, I don't think that's the problem. The only time where missing the signal would cause a problem is if it's sent between the time `$baton == $id` is checked and the time `cond_wait` blocks. The purpose of locking `$baton` is to create the mutual exclusion that should prevent this from happening. The problem might be that your system's implementation of `cond_wait` isn't atomic (while it should be), allowing a signal to come in after `cond_wait` unlocks `$baton`, but before `cond_wait` starts waiting. You could give yourself a safety net by using `cond_timedwait` — `cond_wait` with a timeout — to check `$baton` periodically. Update: Did some repharsing and added the second last paragraph.	[reply] [d/l] [select]
Re^8: baton passing threads and cond_signal by Anonymous Monk on Aug 22, 2007 at 16:25 UTC
If one thread maintains a persistant lock on the variable, how would the other thread ever obtain a lock? Without obtaining a lock, cond_wait doesn't work.	[reply]
Re^9: baton passing threads and cond_signal by ikegami (Patriarch) on Aug 22, 2007 at 16:34 UTC
Re^10: baton passing threads and cond_signal by Anonymous Monk on Aug 22, 2007 at 16:47 UTC
Some notes below your chosen depth have not been shown here