I don't think that threads scale well compared to declarative parallelism.
In my opinion, the rough comparison between threads and declarative parallelism is that of manual memory allocation and garbage collection.
You can make manual allocation faster and more efficient, but it costs extra for each piece of code you write. With garbage collection, it just works.
The nutshell is that using a bunch of threads to gain parallelism is wasteful overhead on a single cpu, and actually holds you back when you have more cpus than threads.
A 'single core' Cell cpu actually has nine cpu cores, and the Cell roadmap for the future goes up to sixty four 'single cores' on a single die. How can you use any kind of explicit threading to deal with 576 cores on a single die? What if it's actually a four socket motherboard with 2304 cores?
Declarative parallelism isn't threads, and it's not coroutines, but it does let your code take advantage of multiple cores to the limit of your algorithm.
More detail is in the thread
A Fundamental Turn Towards Concurrency on lambda-the-ultimate.org.
--
Shae Erisson