Actually, I forgot about the 3rd party SWIG module that is require'ed in the main thread (conditionally, Win32::API for Windows). The worker threads inherit and use it. That could very well be the culprit, although I'm not sure why it would change the refcount on my shared scalar. I could try taking out the module and see if I can replicate the problem. It should run much faster without it.