Connection Failing for unknown reason

sierpinski has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am working (still) on using the Net::SSH::Expect module to connect to a list of servers, gather some information, and then exit, eventually sending a report via email. Part of this report is a list of servers that could not be connected to via ssh. I am noticing that several servers on my list I can connect to manually from the same host, using the same user (ssh keys installed -- no password necessary), and every time I run the script, the subset of these servers changes. I've tried experimenting with the expect object's timeout option, as well as the ssh timeout, and I've made improvements on this list (fewer servers that fall into this category) but I'm still seeing some, and I'm trying to figure out why.

Here is the object I'm creating:

        my $ssh = Net::SSH::Expect->new (
                binary => "/usr/local/bin/ssh",
                host => "$serverlist[$host]",
                user => "$user",
                raw_pty => 1,
                log_file=> "/tmp/$serverlist[$host].log",
                timeout => 4,
                ssh_option => "-o ConnectTimeout=8",
        );
[download]

Now I am running this from a Solaris 10 x86 host, so the default ssh was Sun's own brand. I installed the OpenSSH package from sunfreeware.com so I could utilize the ConnectTimeout option (hence the specified binary). I seem to have better results with this, but as I mentioned before, I'm still not getting all of the servers I should be able to connect to. I feel it's an issue of timeouts, but I'm not sure where the breakdown is.

The output that I have is too extensive to post here, but basically I have a list of 460 servers. I should be able to connect to about 440 of them, yet I consistently connect to 420-430 of them, with that list of 10-20 servers changing each time. A manual connection always succeeds.

Thanks for any help you can provide.

Update:

I should mention that I am using Parallel::ForkManager to process these, so it is done much faster. When I do this with max_procs = 1, I have no issues. I'm not aware of a maximum limit on outgoing ssh connections (I know incoming there is sometimes a limit of 10 unauthenticated ssh connections on Solaris), so I'm still not sure where the problem is. I've tried with max_procs = 5, all the way up to 20, and the problem still exists in all cases.

Comment on Connection Failing for unknown reason Download Code

Replies are listed 'Best First'.
Re: Connection Failing for unknown reason by laminee (Novice) on Aug 06, 2009 at 20:04 UTC
This may not exactly be related to your problem, but I've recently had some experience on working with Net::SSH::Expect. I wanted to execute programs (which can be interactive) on multiple hosts via SSH and capture the output for each of them and used Net::SSH::Expect for this. During the course noticed that there are some serious issues with how Net::SSH::Expect::read_all() determines termination of any command. It follows a logic of timeout duration of inactivity on the input stream - the command's STDOUT - to decide it's termination. Which can be highly misleading. I had to tweak some of Net::SSH::Expect's code to work around this.	[reply]
Re^2: Connection Failing for unknown reason by sierpinski (Chaplain) on Aug 06, 2009 at 20:16 UTC
What did you end up tweaking? You don't have to go into great details, just a general idea I mean.... was it basically moving from a length of time to some other notation of inactivity? Thanks for the response!	[reply]
Re^3: Connection Failing for unknown reason by laminee (Novice) on Aug 06, 2009 at 20:36 UTC
It may not be a very wise solution but this is what I did: in case there is <timeout> duration of inactivity, establish a duplicate connection and check in the process table whether the concerned process is still executing. If yes, do not return from read_all() because my thinking was you have not really read all. And I am passing an extra optional max-timeout parameter to the Net::SSH::Expect::exec() which acts as a check in the loop in read_all() in case the command hangs or something. Update: The optional max-timeout parameter serves another purpose of circumventing a flaw in the inactivity logic of read_all(). What if a command keeps on printing stuff to its STDOUT indefinitely? The default read_all() will never return in such a case.	[reply]