RaduH has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, Continuing my work after I got good advice to my previous question (thanks quester), I've noticed a weird behavior I don't know how to explain. It doesn't bother me, but I'd like to understand it.

Short description: I am running a script on some HostA and this script resets HostB which I can control via ssh. To resume my interaction s with HostB I wanted to know when is HostB back online again. I followed the advice from quester and in a loop I try to connect to port 22 and read whatever the server listening there is about to say to me. Under normal circumstances, that server is the sshd and I should be reading a line beginning with "SSH". If this does not happen I go to sleep for 3 seconds and try again. I do this until I get my "SSH" line and then I go on with the rest of the script.

The thing is, the reboot procedure does take a while and in my try_to_connect-sleep loop I increment a counter just to see how many times I tried and just get some feedback that my script is still alive and probing the other host. the loop looks like this:

do { $connected = connect(SOCK, $paddr); if($connected) { $line = <SOCK>; print("$line online\n"); close(SOCK); $connected = 0; if ($line =~ /SSH/) {# connected to sshd - I'm ready $done = 1; } else {# connected to ... who the heck is listening on port 22? ...some code here } } else { $attempts++; print("Trying again ... ($attempts) \n"); sleep(3); #sleep 3 seconds before trying again } } while(!$done);
My curiosity was triggered by how the "trying again" messages are printed. First one comes as expected, then then second one comes after 2-4 seconds. The third one can take 10 or more seconds to show up and then they come at 3 sec intervals. I am not bothered by how long it takes to get access to sshd so the fact that the total is around 30-40 seconds is not an issue. I am just curious about this weird timing and why is it always the second iteration?

Any ideas anyone?

Thanks!

Replies are listed 'Best First'.
Re: Socket connect curiosity
by quester (Vicar) on Dec 01, 2007 at 10:32 UTC
    Hmm... that is a good question. It is surely due to the timing of events as the target system goes through its startup.

    When the operating system first boots up, it takes a while to load the NIC driver. Before that point it won't respond to the initial SYN packet sent by connect() at all. Once the NIC card is initialized, the operating system kernel will start sending either RST packets as a response to clear the connection or ICMP port unreachable to tell you that nothing is listening. This can be modified by iptables, which should start before the NIC card is initialized, but could start afterwards if your system startup was done by someone with odd tastes. Iptables can be set to ignore connections to ports that are not open without sending any response at all.

    Once sshd starts, it gets to decide whether to accept incoming connections or to reset them, and it may (I'm not sure) reset them until it finishes its initialization. Also, if sshd is set to log incoming connections, there can be a delay of several seconds the first time it tries to resolve the client's IP address to a DNS name (to put into the log entry) and doesn't get a response from the DNS server.

    If you think it is worth some effort to track down, say if it may related to an actual problem you are seeing, you might want to get a packet sniffer and see what packets are actually being exchanged. Normally I would just suggest running wireshark but it won't work properly in this case... it doesn't like to be started up on a NIC card that isn't initialized. The simplest thing would probably be to connect a separate packet sniffer, which could just be a separate machine running wireshark, connected in parallel with the SSH server box using a hub or the port-spanning function on a switch.

      This is one of those areas where you can help yourself by isolating where the problem is. Try and think of all the potential places in your code where delays could occur. The sleep() function is one, definately - and maybe you suspect your sleep() function is broken!

      How shall we determine if sleep() is sleeping for a lot more than your prescribed 3 seconds? Let's try some debugging messages! e.g.

      else { $attempts++; print("Trying again ... ($attempts) \n"); print( scalar localtime() . "About to sleep\n" ); sleep(3); #sleep 3 seconds before trying again print( scalar localtime() . "Finished sleep\n" ); }

      I would bet London to a brick you'll discover the sleep function is just fine. So what else could cause delays in your loop? That connection() function gets called every time so how about we put some debugging messages around that? e.g.

      print( scalar localtime() . "About to connect\n" ); $connected = connect(SOCK, $paddr); print( scalar localtime() . "Finished connect\n" );

      To make your life a bit easier you might want to autoflush messages so that they appear as soon as you print them. Stick the following at the top of your script:

      select( ( select(STDOUT), $| = 1 )[0] ); select( ( select(STDERR), $| = 1 )[0] );

      Next you'll probably ask how you can implement a time-out mechanism on your connect. That's a two part problem: controlling how long the DNS look up takes (when DNS breaks resolving addresses can take a long time) and controlling how long the connect will wait. For now I won't advise how to deal with this problem; I hope I've given you some insight as to how to determine what the problem you're actually facing is.

Re: Socket connect curiosity
by RaduH (Scribe) on Dec 03, 2007 at 17:16 UTC
    Thanks quester and monarch,

    Running a sniffer I out of the question given the environment I operate in. I did try monarch's suggestions and here's the output:

    --> Rebooting ... main::p_wait_alive --> Mon Dec 3 09:48:49 2007 Before nap main::p_wait_alive --> Mon Dec 3 09:48:52 2007 After nap main::p_wait_alive --> Connecting to port 22 ... main::p_wait_alive --> Mon Dec 3 09:48:52 2007 Before connect main::p_wait_alive --> Mon Dec 3 09:50:28 2007 After connect main::p_wait_alive --> Trying again ... (1) main::p_wait_alive --> Mon Dec 3 09:50:28 2007 Before sleep main::p_wait_alive --> Mon Dec 3 09:50:31 2007 After sleep main::p_wait_alive --> Mon Dec 3 09:50:31 2007 Before connect main::p_wait_alive --> Mon Dec 3 09:50:34 2007 After connect main::p_wait_alive --> Trying again ... (2) main::p_wait_alive --> Mon Dec 3 09:50:34 2007 Before sleep main::p_wait_alive --> Mon Dec 3 09:50:37 2007 After sleep main::p_wait_alive --> Mon Dec 3 09:50:37 2007 Before connect main::p_wait_alive --> Mon Dec 3 09:50:46 2007 After connect main::p_wait_alive --> SSH-1.99-OpenSSH_4.2 online main::reboot --> HostB is back online
    A few explanations ... instead of print is have my own subroutine for debug messages. The behavior is controlled by a flag (print or do not print anything) and the structure of a debug message is module::subroutine (where is it printed from) and on the new line is the message prefixed by --> (hope this makes the blurb above somewhat more friendly).

    Using non-buffered output is in my opinion the key element for the difference between what I had noticed before and what I witnessed just now. I knew about the output buffering problem in general in software development, I just didn't know how/if Perl was addressing it and how to handle that (I also must admit I had ignored it for this problem until you pointed it out for me - thanks). Live and learn :)

    Anyway, going back to the problem, the nap period is a 3 second sleep I spend before going into my loop to make sure the other host has time to go down before I try to connect such that I won't "successfully" connect to a dieing server and declare the host to be online (sure, I could re-architect my loop to sleep first, try to connect later, but that's besides the point here).

    It's clear that the first connect is taking forever to timeout and return (that's what I think happens after my nap, as I start looping), then it seems the other connect calls take 3 seconds to time out. It appears the sleep calls take the desired 3 seconds.

    In conclusion, after turning off output buffering I iterate only about 3 times, the first connect takes a very long time and then each connect takes 3 seconds to time out and all sleep calls behave as expected.

    OK while typing the paragraph above I also thought about running the script with the debug messages but with the buffering turned on. Here's the interesting output:

    --> Rebooting ... main::p_wait_alive --> Mon Dec 3 10:09:52 2007 Before nap main::p_wait_alive --> Mon Dec 3 10:09:55 2007 After nap main::p_wait_alive --> Connecting to port 22 ... main::p_wait_alive --> Mon Dec 3 10:09:55 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:10:43 2007 After connect main::p_wait_alive --> Trying again ... (1) main::p_wait_alive --> Mon Dec 3 10:10:43 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:10:46 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:10:46 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:10:49 2007 After connect main::p_wait_alive --> Trying again ... (2) main::p_wait_alive --> Mon Dec 3 10:10:49 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:10:52 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:10:52 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:10:55 2007 After connect main::p_wait_alive --> Trying again ... (3) main::p_wait_alive --> Mon Dec 3 10:10:55 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:10:58 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:10:58 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:01 2007 After connect main::p_wait_alive --> Trying again ... (4) main::p_wait_alive --> Mon Dec 3 10:11:01 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:04 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:04 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:07 2007 After connect main::p_wait_alive --> Trying again ... (5) main::p_wait_alive --> Mon Dec 3 10:11:07 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:10 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:10 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:13 2007 After connect main::p_wait_alive --> Trying again ... (6) main::p_wait_alive --> Mon Dec 3 10:11:13 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:16 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:16 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:19 2007 After connect main::p_wait_alive --> Trying again ... (7) main::p_wait_alive --> Mon Dec 3 10:11:19 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:22 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:22 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:25 2007 After connect main::p_wait_alive --> Trying again ... (8) main::p_wait_alive --> Mon Dec 3 10:11:25 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:28 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:28 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:31 2007 After connect main::p_wait_alive --> Trying again ... (9) main::p_wait_alive --> Mon Dec 3 10:11:31 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:34 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:34 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:37 2007 After connect main::p_wait_alive --> Trying again ... (10) main::p_wait_alive --> Mon Dec 3 10:11:37 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:40 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:40 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:43 2007 After connect main::p_wait_alive --> Trying again ... (11) main::p_wait_alive --> Mon Dec 3 10:11:43 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:46 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:46 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:49 2007 After connect main::p_wait_alive --> Trying again ... (12) main::p_wait_alive --> Mon Dec 3 10:11:49 2007 Before sleep main::p_wait_alive --> Mon Dec 3 10:11:52 2007 After sleep main::p_wait_alive --> Mon Dec 3 10:11:52 2007 Before connect main::p_wait_alive --> Mon Dec 3 10:11:54 2007 After connect main::p_wait_alive --> SSH-1.99-OpenSSH_4.2 online main::reboot --> HostB is back online
    The only difference is the number of iterations. The first connect still takes a long time to time out, the rest is exactly the same, just more times...

    Regards,

    RaduH