I was talking about money actually - 256 cores system will be very expensive.
I'm not sure about the relevance of that?
When it is affordable--it's already available; the IBM Power 795 can have 256 cores and 4 hardware threads per core) giving a 1024 thread processor in a box--a threaded solution will port to it with the change of one number.
Two years or so ago, 4 cores were horribly expensive. I now have one sitting in front of me that cost about the same as a high end smartphone!
Next year will see the release of 16-core commodity processors, some with 2-way hyper threading. When my next "refresh" cycle comes around in 2012, I'll be looking for 16-cores for dirt cheap price. I'll probably have to wait until towards the end of the year, rather than the beginning. I'll also be looking to have 512/1024 core GPU in the same box for the same price.
I know POE can span clusters, but clusters don't make sense. Other than as a stop gap solution until a multi-core fitting your aspirations or price becomes available.
For why, there is a wonderful example of the problem cited in an article I read just today. The fourth para (starting "For example") is the crux of the cluster problem (albeit this example is gpus).
Just as scaling with processes tops out very quickly because of the costs of IPC; so clustering tops out even more quickly because of the even higher costs of inter-box network comms (INC). And as processors become more efficient at processing a given volume of data, so the ratio of non-productive IPC and INC to useful CPU work grows.
And the tests are not correct BTW. Time required for creating new thread depends on the size of all existing variables.
Yes. That is an annoying detail of the ithreads implementation. But, it is quite easy to avoid; you just spawn your workers early, and have them require rather than use what they (individually) need.
But forks face a similar problem--indeed, the copy problem exists in iThreads specifically because of the attempt to make threading look like fork. If you need to share data between forks, there is still a "duplication penalty", although it is disguised by coming at use-time, rather than spawn-time.
COW may appear to avoid the need to duplicate data memory, but it just means it gets duplicated piecemeal on use, rather than in one lump up front. Even if it is "read-only" in the sense of your program, it is often modified by simple "read-only" accesses.
Use $#array and a 4k page of COW'd memory get copied. Use a regex on a string that alters its pos, and another 4k page gets copied. Even if the string is only 4 characters. Use a single scalar, instantiated as a number, in a string context, and another 4k page gets copied. Use each, keys, or values on a hash, and (at least) another 4k gets copied.
These piecemeal requests for another single page of VM followed by a 4k copy through ring 0 add up to far more than an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.