I am guessing that you have more than enough experience to understand how difficult it can be to find intermittent bugs. This is especially true if your own code isn't the sole issue. Over the last 30 years, I have worked on more hardware, OS, and development languages and framework than I care to remember. Without exception, they all at time to time exhibited behaviors that cannot readily, or even after a lot work, be explained.
At some point, I as a human, will throw in the towel on getting to the bottom of a problem if I can take a path around it or mitigate it in some means. I have decided that I will not, not will not bother to, expend any additional hours of my life that I cannot get back to research, test and identify the root cause of my issue.
Then surly you know how to fix this problem.
Use a C debugger, attach to the OS thread, look at the callstack, and fix it. You probably need a DEBUGGING build of perl so your C debugger works cleaner. No OS thread can "freeze" itself or deschedule itself without help from the OS. If the thread busy waiting sucking cpu, ps/top/task manager will tell you. If your threads are "disappearing" (I can't tell if you mean they disappear from the OS, or they simply freeze indefinitely) without a trace, chances are very high your leaking resources and ram too. Since your use Unix, have your tried setting signal SEGV/BUS/FPE/ABRT/other CPU exceptions from perl to see if your thread is throwing one of those signals?
You haven't shown any code, or explained what specific CPAN modules and C libraries and XS modules you are using. I'll make a wild ass guess and say you probably have what is called a race condition in a 3rd party C library (access vio or thread sync/mutex deadlock), or you are doing network I/O with no timeout.
I'm not sure exactly how process/user resource limits work on Linux (I'm a Win32 person). I've read that the Linux kernel doesn't know what threads are, each thread is a separate process in the same memory sandbox. So maybe one of your OS threads/ithreads hit the OS per thread resource limit and thats why it disappears (if disappearing is what happens on Linux when 1 thread in a multi threaded process hits a per thread resource limit).
If you showed code, I bet
BrowserUk could track it down as he claims in
Re^5: Thread::Pool::Simple || !.