Add to that the fault I'm chasing is inetermittant
From experience, using a debugger is very unlikely to allow you to track down intermittent threading faults anyway. The very act of running code under debugger control inevitably changes the dynamics.
If you are exceptionally lucky, the change will make the problem occur more frequently and reliably, but in 20+ years of working with threaded code that happened exactly once. On every other occasion, even semi-reliable bugs would fail to manifest themselves at all under the debugger and reappear as soon as it was taken out of the picture.
In my experience, the first thing to do with intermediate bugs is make them reproducible. That means running the code in a controlled (repeatable) way in a production-like environment until the problem is reliably reproducible.
The next thing to do, is track down what is going on, and where, when the problem occurs. And that always means adding wide-spread, low-granularity, low-overhead logging.
Wide-spread means in all threads and functions.
Don't make the mistake of guessing where the problem might be and concentrating there. You're usually wrong!
Low granularity means don't try and log everything.
There is no point in logging huge volumes of unrelated details. It just gives more crap to wade through and more importantly, can change the dynamics. That can cause the problem to move or disappear completely.
Low overhead means avoid anything that might cause memory allocation or CRT locks to be used.
For example, printf (the C version) will usually need to allocate some temporary space for the formatting. That can cause your bug to 'move'.
Equally, threadsafe-CRTs often employ internal locking when performing IO (writing to files etc.). Again, that can cause the bug to move or disappear entirely -- until the logging is removed!
Often the best logging mechanism is a very simple, unformatted output (ex.just the current thread and line number) using the UDP sendto() function to a local port. On the other end of that port you have a program that simply listens to the port and logs the data to disk (preferably one not used by the monitored program; a USB thumb-drive is ideal!), in a tight loop with no attempt at interpreting it.
This logging can be added to the code at (say) the entry and exit points of the major functions, with little or no impact on the performance or dynamics of the code being monitored. Once the error occurs, you can inspect that log to work out where each thread was when it occurred. You can then remove most of the logging and increase the granularity within those functions active when the bug manifests. Re-run and gradually 'zoom in' on the specific circumstances that cause the bug to arise.
It may sound somewhat crude and slow, but with a little practice (and some well-crafted macros if you are using C), it is very effective. Once you know where each thread is (and therefore what it is doing) when the bug occurs, it is usually obvious where the problem lies.
If your code is not proprietary, I'd be willing to take a look. No promises -- I probably couldn't run it here -- but I might spot something.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.