With that many parallel processes all trying for network access, I have encountered some limitation. I think it's not in the host memory or number of processes, but somewhere deeper in the network drivers on the host. You can get a similar result by, for instance, trying to ping multiple hosts in parallel -- above a certain number of hosts, the network response goes horribly sluggish.
For ethernet, complete congestion results in many retries, with each retry picking a random wait time from an ever larger window (see [no such wiki, Exponential_backoff]). So 200 parallel processes would have many ethernet collisions, and some small fraction would end up with the maximum backoff time. At some point normal ssh connections timeout due to lack of activity, and drop.
I had exactly this problem with a little script I wrote years ago, before I knew about Parallel::ForkManager and the like. At the time it didn't matter that I didn't get all of the responses, and it wasn't for any automated system, just my own whims on finding a remote host with certain conditions. (See the doc page for how to limit the number of parallel processes.)
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
Hello all,
Thanks for your feedback. I tried to minimize the number of processes running at the same time by introducing a delay in the loop before spawning a new process. This way the total number of parallel processes running would be less, since some would have finished before others start.
It made my script a bit slower but I had zero failures. | [reply] |
You can also go for Net::OpenSSH::Parallel which knows how to handle most of the issues you are facing by itself.
| [reply] |
| [reply] |
| [reply] [d/l] |
OK,
So I have changed the timer from 240 to 340. What happened is that script is successful in many more nodes. However now I get a lot of errors
SSHProcessError The ssh process was terminated. at diameter_Status_Script.pl line 123.
That line is:
$ssh->waitfor("#", 240);
| [reply] |
| [reply] [d/l] |