I'm doing a walk on a each device for example to get the interface information on over 1000 devices at the same time ( it depends on whats needed by the engineer). I decided to leave the threads implementation because of the huge overhead and im currently using fork to initiate about 10 process to start each session with each device and its still not fast enough. I would appreciate if you go into more detail about your implementation. Thanks!