Detecting and reaping stale sockets

SIGSEGV has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have to device a script that detects AF_INET sockets that seem broken and cause SIGBUS core dumps. These sockets are connections between a webserver accessible from internet and a database server behind a firewall that services queries from the webserver.

The DBAs responsible for the DBMS suspect that the firewall is the culprit that severs routes that haven't been used for a certain period.

First I need to know how to find out if a socket is a probable candidate to throw a SIGBUS soon. My first shot at this is a mere parsing of the netstat command, especially aiming at sockets in a CLOSE_WAIT state. (see code sample below).

I'd rather do this through the Socket or IO::Socket module but don't know how to read the states of active sockets (like netstat displays them). Maybe someone can give me a hint in that direction?

Then I need an (almost) obvious criterion for killing processes that keep a broken end of the socket open. (how to identify those in the absence of lsof or similar tools?).

I think the whole should have been implemented already in the code of the applications that establish sockets in form of a signal handler that closes sockets properly on receipt of signals such as SIGBUS, SIGPIPE etc. Unfortunately the application that causes this is a black box to me, and I have no access to its code.

Here my first shot to dump socket states in an array through a little Perl script:

#!/opt/perl5/bin/perl

use 5.006;
use strict;
use warnings;

my @SOCKET_STATES = qw(ESTABLISHED SYN_SENT SYN_RECV FIN_WAIT1 FIN_WAI
+T2
                       TIME_WAIT CLOSED CLOSE_WAIT LAST_ACK LISTEN CLO
+SING
                       UNKNOWN);
my %state_counts;

my @inet_sockets = parse_netstat();

map $state_counts{$_->{State}}++, @inet_sockets;

my @stale_socks = map $_->[0],
                  grep $_->[1] eq 'CLOSE_WAIT',
                  map [$_, $_->{State}], @inet_sockets;

my @stale_ips = map $_->[0],
                sort {$a->[1] <=> $b->[1]
                               ||
                      $a->[2] <=> $b->[2]
                               ||
                      $a->[3] <=> $b->[3]
                               ||
                      $a->[4] <=> $b->[4]}
                map [$_, split(/\./, $_->{Foreign_IP})], @stale_socks;

print "Summary of socket states for type AF_INET:\n\n";
my $sum = 0;
print map { $sum += $state_counts{$_};
            sprintf "%12s = %4u\n", $_, $state_counts{$_}
          } 
      sort keys %state_counts;
printf "%s\n%12s = %4u\n\n", '-'x19, 'TOTAL', $sum;

print "The $state_counts{CLOSE_WAIT} foreign addresses of sockets of s
+tate CLOSE
_WAIT:\n\n";
print map {sprintf "%30s\n", $_->{Foreign_IP}.':'. $_->{Foreign_Port}}
+ @stale_ip
s;

sub parse_netstat {

my %CMD = ( exe => '/usr/bin/netstat',
            args => [qw(-a -n -f inet)],
            dump_keys => [qw(Protocol Recv-Q Send-Q Local_IP Foreign_I
+P
                             State Local_Port Foreign_Port)],   
          );

local *NETSTAT;
my @dump;

my $pid = open NETSTAT, '-|';
die "cannot fork '$CMD{exe} @{$CMD{args}}': $!\n" unless defined $pid;
if ($pid) {
    my %rec = ();
    while (<NETSTAT>) {
        s/^\s+|\s+$//g;
        @rec{@{$CMD{dump_keys}}[0..5]} = split;
        next unless $rec{Protocol} eq 'tcp';
        @rec{@{$CMD{dump_keys}}[3,-2]} = $rec{$CMD{dump_keys}[3]} =~
            /(\d+\.\d+\.\d+\.\d+|\*)\.(\d+|\*)/o;
        @rec{@{$CMD{dump_keys}}[4,-1]} = $rec{$CMD{dump_keys}[4]} =~
            /(\d+\.\d+\.\d+\.\d+|\*)\.(\d+|\*)/o;
        push @dump, {%rec};
    }
} else {
    exec $CMD{exe}, @{$CMD{args}};
    die "premature demise of child $pid\n";
}
close NETSTAT or die "cannot close pipe from '$CMD{exe} @{$CMD{args}}'
+: $!\n";
return @dump;
}
[download]

This dumps something like such (the list of foreign IPs here discarded):

$ perl socklst.pl|head -14
Summary of socket states for type AF_INET:

      CLOSED =    1
  CLOSE_WAIT =   40
 ESTABLISHED =   28
  FIN_WAIT_1 =    1
  FIN_WAIT_2 =  323
      LISTEN =  116
   TIME_WAIT =   41
-------------------
       TOTAL =  550

The 40 foreign addresses of sockets of state CLOSE_WAIT:
[download]

Wouldn't you agree that there are far too many FIN_WAIT2 state sockets? At least to me (though I don't have a networker's background) this looks screwed up.

TIA

Comment on Detecting and reaping stale sockets Select or Download Code

Replies are listed 'Best First'.
Re: Detecting and reaping stale sockets by Zapawork (Scribe) on Dec 19, 2002 at 19:28 UTC
Hi TIA, The first question is can you find the file descriptor for the socket that is currently being seen as waiting to close or open a connection. If you can locate the socket, you could do this with the output of a tool like lsof (list open file handles) but I do not know how you would do this in perl other than possibly reading proc, then you need to pass the descriptor to something like select or poll, you probably want poll since it will not return until it finds a condition in one of the sockets. As far as I know though since the socket is already in a closing state, FIN wait meaning it is waiting for the other side to finish closing the connection, the standard read/write tools may not help. These tools have an error condition but not a 'test for half closed condition'. You could just manually close the sockets but that may be a bad idea as you don't know how the original application is handling it. You could also try to write raw ip packets and spoof fin packets to the localhost but that could also cause the original application to grow unhappy. A better thing to do would be to fix the actual issue at hand then trying to bandage a solution. If your firewall is timingout connection then it sounds like it is maintaing a state table and possibly translating (NAT) the connection. You do not need to do this. Why? Because there is no benefit to keeping a state on a persistent database connection and no added security since as far as I've ever seen no one has made a list of secure database commands to check against in the application layer. This would be very hard to do since most database problems come in valid syntax queries. So what you really need to do is ask your firewall/network person to treat your connection as a static route and only apply packet filtering rules, no state. This is achievable but can be difficult on some firewalls, cisco pix for example. that treat every connection through them as a route. This will prevent the firewall from closing the socket and preventing the webserver from sending the fin in the first place. If this does not work for you let me know, there are some other solutions you could try that I've helped clients with. Just my opinion. Dave -- Saving the world one node at a time	[reply]

Replies are listed 'Best First'.

Re: Detecting and reaping stale sockets
by Zapawork (Scribe) on Dec 19, 2002 at 19:28 UTC

A better thing to do would be to fix the actual issue at hand then trying to bandage a solution. If your firewall is timingout connection then it sounds like it is maintaing a state table and possibly translating (NAT) the connection. You do not need to do this. Why? Because there is no benefit to keeping a state on a persistent database connection and no added security since as far as I've ever seen no one has made a list of secure database commands to check against in the application layer. This would be very hard to do since most database problems come in valid syntax queries.

So what you really need to do is ask your firewall/network person to treat your connection as a static route and only apply packet filtering rules, no state. This is achievable but can be difficult on some firewalls, cisco pix for example. that treat every connection through them as a route. This will prevent the firewall from closing the socket and preventing the webserver from sending the fin in the first place.

If this does not work for you let me know, there are some other solutions you could try that I've helped clients with.

Just my opinion.

Dave -- Saving the world one node at a time

[reply]