Hmm. Having all builtin Array/Hash/Scalar structures be implicitly transactional and locklessly shared, then allow explicit non-shared cloned state and channels on top of that, seems to me to be easier to scale and reason about, than the other way around.
Also, native OS threads are still in a single process in Unix, and Perl 5 does use 1:1 mapping from Perl threads to native OS threads on Unix, where pthreads is available (see thread.h). Which is also expensive, as 1:1 mapping is only necessary if you do blocking system calls. So I'm not sure your first point (that the entire program only use one pthread) holds...