I think you've misinterpreted The Problem with Threads. "Nondeterminism should be explicitly and judiciously introduced where needed, rather than removed where not needed."'
Actually, no I haven't. And your quote proves it.
The single biggest source of both bugs & design fails, and unrealised performance, is synchronisation and locking. Ie. Attempts to impose determinism.
People insist on trying to enforce the determinism that when this thread has finished with this piece of data; that thread starts processing it. They'll use semaphores, or locks, or signals, or mutexes, and often some combination of them. And with them come all the nasties wrongly attributed to "threading"; dead-locks, live-locks, and priority inversions; often transient, and always a nightmare to diagnose and fix.
And even when they get it right, all they've done is create a lock-stepped, and thus sequential process. Almost, if not exactly, as if no threading had been done.
Except of course, they've carefully orchestrated that there be at least one, and often multiple, context swaps occurring between steps. And then they wonder why it's slowed everything down. By attempting to deterministically control the flow of data between threads, you're working against the system scheduler and guaranteeing that no thread ever uses its full timeslice. Indeed, many times you're actually forcing the scheduler to load a thread only to immediately remove it again because the other thread that you're trying to pass control to hasn't yet run.
The archetypal example of this is the Global Interpreter Lock as seen in Python, Ruby et al.
Think of this like a hub airport would be if there was no buffer space -- aprons and gates. Planes would have to circle until one of the connecting flights arrived, then both land, exchange passengers; then both take off and circle again until another connecting flight was available. A stupid, but surprisingly accurate analogy.
And people forget -- and that includes academics who tend to test and benchmark their algorithms in isolation of other workloads -- that the scheduler isn't just scheduling the threads of their process, but also the threads of every other process in the system. If you fail to utilise the full timeslice, the remainder doesn't automatically get transferred to another thread of your process, but most times gets lost entirely because (the thread of) another process, and often many other processes, get scheduled before your process gets another chance. And even when it (your process) does get another slot, it may be the wrong thread, or even the same thread that just abandoned its last one.
To make best use of threading, you need to embrace non-determinism, by eschewing locking and synchronisation and the conditional checks associated with them -- unless it is absolutely required, which with appropriately consider algorithms, is very rare -- and allow things to free run.
By providing low-overhead, flexible sized buffers (queues) between sequential sections of processing, you allow the well-tried and proven system scheduler with its multiple classes, dynamic priorities, dynamic timeslices et al. to give timeslices to threads that need them, in a timely manner.
Imagine the complexity of locking required by a shared hash with multiple reading and writing threads performing concurrent access. And imagine trying to test it; and then debug it.
Then, if you are really interested, go off and watch Cliff Click describe his lock-free hash table. Watch it all. See how he scales a hash table to be used concurrently from thousands of threads. See also the discussion of the bottlenecks with the standard Java ConcurrentHashMap.
Bottom line: Everything in that paper (The Problem with Threads) tackles the "problem" from the completely wrong direction. Had it not been dated 2006, I would have assumed that it had been written circa. 1995 or earlier, such is the level of dated thinking it employs.
Are you really saying that, when solving any concurrent problem correctly (data-wise), Futures are never (or seldom) simpler for the programmers writing, reading, and modifying the code than directly using the underlying low level concurrency constructs ?
No. I like Futures a lot. They are the best (cleanest; easiest to grasp), high level threading construct I've yet seen. For a certain class of problems, a well implemented -- and that means NOT the obvious, naive implementation -- they can/could be a very simple and effective solution.
But, I'm am saying that they move the problems somewhere else.
In effect, all (naive) Futures do, is wrap-over (hide) the mechanics of ThreadCreate() and ThreadJoin(), And all their problems still persist.
When a Future comes to be realised -- the thread comes to be joined -- if that thread is not yet finished, it will block. And as soon as you have blocking, you have the potential for deadlocking, live-locking and priority inversion. Except now the locking is hidden from, and inaccessible to the programmer.
And what's worse, the very simplicity of Futures is -- as happened with the synchronized keyword in Java -- going to encourage people to throw threads at everything without understanding or thinking through the consequences. The very act of providing a simple, high-level construct that hides much of the mechanics of threading and locking, will encourage them to be used widely and inappropriately.
It is possible to implement the hidden lock such that it will detect deadlocks and produce diagnostics, but the runtime penalties of every implementation I have seen are such that for any performance critical application -- why else are you using threading? -- the penalties add up to the point where programmers will need to revert to the low-level constructs in order to regain the level of control they need to realise the performance they are seeking.
And finally, there are a whole raft of classes of algorithm for which (naive) Futures are not just inappropriate, but completely unusable. For some of these classes, a sophisticated implementation can mitigate that and render them usable; but that is still the subject of research.
The bottom line is that to implement Futures such that they are reliable, easy to use and sophisticated enough to allow their use for a wide range of problems, is extremely hard. And testing them even harder. If P6 shipped -- should that mythical event ever occur -- with a naive implementation, or a poorly tested sophisticated one, the damage to its reputation would be huge.
Now think how long (and how many re-writes) it has taken for Java threading to approach some level of real usability.