hubb0r has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've written a program which is basically a network health checking server for databases. It is two parts, the first being a basic network server which listens for requests on a port and responds with a hash to the caller. The second part, and the part which I'm having trouble with, is a process which runs in a loop, updating a memory cache of server information when it ages out via a series of health metrics which it collects.
I've utilized Proc::Simple to spawn the process which does the health checks and updates the cache from the main process (network server utilizing Net::Server::PreForkSimple.) The cache thread utilizes Cache::FastMmap to implement an aging cache which the PreForkSimple processes read from to get their data. The cache thread begins updating the cache at timeout / 2 seconds since the last update, which gives it enough time to keep the cache fresh and not cause the client threads to incur lag while waiting for the cache to be renewed by the health checking process. The health checks are fairly basic: a ping, traceroute, db connect time, then connect to the db and look at the processlist and get a sense of its health. I use these metrics to determine which machines are available, and to order them in a most desirable to least desirable array.

This all works fine, and in fact there is a more simplistic version of this code that has been running for about a year now without a glitch.

The problem is this: After a few hours of running, the health checks begin to fail somewhere in making a network connection. I've tried wrapping things in evals to catch errors, timeouts, etc. I've turned the debugging levels all the way up on all the modules I'm using. I've gotten nowhere! Net::Traceroute dies with
Can't call method "reader" on an undefined value at /usr/local/lib/per +l5/site_perl/5.8.6/Net/Traceroute.pm line 412.
Net::Ping dies with:
Can't get tcp protocol by name at ... <my program name and line>
If I skip them and go straight to a DBI connection, it too fails. (Can't find it in the log messages). I've attempted to skip the nameserver lookup and use ip addresses directly, and that has the same result.

If I restart the process, it works fine again for a number of hours, and then the same failure process begins again.
The code is attached below: (note, it won't run as you need a config file and servers on your network to utilize)
#!/usr/local/bin/perl -w package speak_db_names2; @ISA = qw{Net::Server::PreForkSimple}; use strict; use Cache::FastMmap; use Config::Auto; use Data::Dumper; use DBI; use File::Basename; use Net::Nslookup; use Net::Ping; use Net::Server::PreForkSimple; use Net::Traceroute; use POSIX qw{:signal_h}; use Proc::Simple; use Storable qw{nstore_fd}; use Time::Format qw{%time}; use Time::HiRes qw{time sleep}; ## configuration information {{{ my $SERVER_FILE = dirname($0) . "/DB_NAMES2"; my $config = Config::Auto::parse($SERVER_FILE); my $servers; my $preferred_order_prefs; ## Loop through the config values and ensure that each config option +(servers) are array refs: while ( my ( $key, $value ) = each %$config ) { ## only do this to server storage config options if ( $key =~ /.+_servers$/ ) { if ( ref($value) ) { $servers->{$key} = $value; } else { $servers->{$key} = [ $value ]; } ## Set a preferred server order pref for each known group if ( exists $config->{$key . "_have_preferred_order"} ) { $preferred_order_prefs->{$key} = $config->{$key . "_have_ +preferred_order"}; } else { $preferred_order_prefs->{$key} = 0; } } } my $cache_timeout = 7; # Allow things to live this long in the cache my $verbose = 0; # Be verbose to logfile my $prefer = 5; # Weighting for preferred servers my $DEBUG = 5; ## Problem Weights. These are multipliers for each specific metric. + Lower is better on the check scale my %prob_wghts = ( checks => 100, connect_time => 1000, ## This number is normally in seconds, +ie .00823. This makes it significant ping => 1000, ## Same here hops => 1, processes => .1, slow_queries => 10, ); ## }}} my $ping = Net::Ping->new('icmp', 2) or die "Cannot initialize Net::Ping: $!"; $ping->hires(1); ## Make a server instance, and call the run method. From here forwar +d, things are handled by the hooks provided by the Net::Server::PreFo +rkSimple class my $server = bless { }, 'speak_db_names2'; $server->run(conf_file => dirname($0) . '/speak_db_names2.conf'); ## subs below ## sub check_health {{{ sub check_health { my %healthy_servers_hash; ## run health checks {{{ while ( my ($group, $these_servers) = each %$servers ) { ## health is evaluated for the group. my %health; foreach my $host ( @$these_servers ) { $Net::Nslookup::TIMEOUT = 2; if ( $DEBUG >= 4 ) { $Net::Nslookup::DEBUG = 1; } my $ip = nslookup($host); debug(3, "Nslookup $host: $ip"); if ( my $ping_time = check_ping($ip) ) { $health{$host}{ping} = $ping_time; } else { next; } if ( my $traceroute = check_traceroute($ip) ) { $health{$host}{hops} = sprintf('%2d', $traceroute); } else { next; } ## Get a db handle and check the connection time my ($dbh, $connect_time) = check_connect_time($ip); debug(5, "$host DBH = " . Dumper($dbh)); debug(5, "$host DB connect_time = " . Dumper($connect_time +)); ## Return of -1 means that we were not able to connect af +ter the allowed timeout if ( $dbh ) { $health{$host}{connect_time} = $connect_time; } else { next; } ## Check the running processlist on the server if ( my $procs = check_processes($dbh) ) { while ( my ($key, $value) = each %$procs ) { $health{$host}{$key} = $value; } } else { next; } $health{$host}{host_ok} = 1; } my $health = "Health hash for group $group: \n" . join(" ", Du +mper(%health)); debug(3, $health, 1); my @ordered_servers = evaluate_server_order(\%health, $these_s +ervers, $group); $healthy_servers_hash{ $group } = \@ordered_servers; } ## }}} return \%healthy_servers_hash; } ## }}} ## sub evaluate_server_order {{{ sub evaluate_server_order { my $health = shift; my $pref_order = shift; my $group = shift; my %server_order; verbose(undef, $group); ## Loop through the metrics and apply weights foreach my $metric ( qw{ connect_time checks ping processes slow_queries hops } ) { ## Order the metrics from lowest to greatest my %tmp_hash = map { $_ => $health->{$_}{$metric} } grep { $health->{$_}{host_ok} } keys %$health; foreach ( keys %tmp_hash ) { $server_order{$_} += sprintf('%6d', ( $tmp_hash{$_} * $pro +b_wghts{$metric} )); } verbose(\%tmp_hash, $metric) } verbose(\%server_order, 'Server Order'); ## Add a multiplier for preferred servers if ( $preferred_order_prefs->{$group} ) { if ($prefer) { foreach (0 .. $#$pref_order) { if ($server_order{$pref_order->[$_]}) { $server_order{$pref_order->[$_]} += $_ * $prefer; } } } } my @server_order = sort { $server_order{$a} <=> $server_order{$b} +} keys %server_order; return @server_order; } ## }}} ## sub process_request {{{ sub process_request { my $self = shift; my $serve_data; for ( 1 .. 100 ) { $serve_data = $self->{'cache'}->get('servers'); if ( defined $serve_data ) { last; } elsif ( ! $self->{updater}->poll() ) { debug(2, "The Updater is dead... long live the Updater!"); $self->{updater}->kill(); $self->{updater}->start(\&update_cache, $self); } sleep .25 } nstore_fd $serve_data, \*STDOUT; } ## }}} ## sub pre_server_close_hook {{{ sub pre_server_close_hook { my $self = shift; $self->{'updater'}->kill(); } ## }}} ## sub pre_loop_hook {{{ sub pre_loop_hook { my $self = shift; $self->{cache} = Cache::FastMmap->new( cache_size => '1k', init +_file => 1, expire_time => $cache_timeout ); $self->{updater} = new Proc::Simple(); $self->{updater}->start(\&update_cache, $self); } ## }}} ## sub update_cache {{{ sub update_cache { my $self = shift; my $seconds_since_update = 0; while (1) { my $serve_data = $self->{'cache'}->get('servers'); if ( ! defined $serve_data or $seconds_since_update >= $cache_ +timeout / 2 ) { my $update_start = time; debug(2, "Updater reloading cache...."); $serve_data = $self->check_health(); my $update_end = time; my $update_duration = sprintf('%2.2f', $update_end - $ +update_start); debug(2, "Update duration: $update_duration seconds.") +; $seconds_since_update = 0; debug(1, "Servers now: " . Dumper($serve_data) ); $self->{'cache'}->set('servers', $serve_data); } else { $seconds_since_update += .5; sleep .5; debug(3, "Seconds since last update: $seconds_since_update +."); } } } ## }}} ## sub verbose {{{ sub verbose { my $hash = shift; my $type = shift; if ($verbose) { my $info; if ( not defined($hash) ) { $info = sprintf("%-20s %6s: %20s - \n", $time{'yyyy-mm-dd +hh:mm:ss'},$$,$type); } else { $info .= sprintf("%10s\n", $type); } foreach my $key ( sort { $hash->{$a} <=> $hash->{$b} } keys %$ +hash ) { $info .= sprintf("%40s => %6d\n", $key, $hash->{$key}); } Net::Server::write_to_log_hook($server, 3, $info); } } ## }}} ## Health checking functions: {{{ ## sub check_connect_time {{{ sub check_connect_time { my $db_host = shift; debug(4, "Checking connect time for $db_host"); my $dbh; my $connect_time = ''; my $seconds = 2; my $mask = POSIX::SigSet->new( SIGALRM ); # signals to mask in +the handler my $action = POSIX::SigAction->new( sub { die "connect timeout" }, # the handle +r code ref $mask, ); my $oldaction = POSIX::SigAction->new(); sigaction( &POSIX::SIGALRM, $action, $oldaction ); eval { alarm($seconds); my $start_time = time; $dbh = DBI->connect_cached("DBI:mysql:database=test;host=$db_h +ost", 'check_health', '*********', { RaiseError => 1, PrintError => 0 }) or die "Could not connect to $db_host: " . $DBI::errstr; my $end_time = time; $connect_time = $end_time - $start_time; alarm 0; # cancel alarm (if code ran fast) }; sigaction( &POSIX::SIGALRM, $oldaction ); # restore original sign +al handler if ( $@ ) { debug(4, "Problem!: $@"); return ( 0, undef ); } else { debug(4, "OK - $connect_time seconds"); return ( $dbh, $connect_time ); } } ## }}} ## sub check_ping {{{ sub check_ping { my $host = shift; debug(4, "Running ping for $host: "); my $round_trip; eval { ## Ping the server for initial connectivity check if ( my @ping_ret = $ping->ping($host) ) { debug(4, "OK - $ping_ret[1] seconds"); $round_trip = $ping_ret[1]; } }; if ( $@ ) { debug(1, "Problem with ping: $@"); } if ($round_trip) { return $round_trip; } debug(4, "Not OK!"); return 0; } ## }}} ## sub check_processess {{{ sub check_processes { my $dbh = shift; my %health; debug(4, "Running process check: "); if ( my $ret = $dbh->selectall_arrayref('show processlist') ) { $health{processes} = scalar(@$ret); $health{slow_queries} = scalar( grep { /(?:lock)/i } map { $_->[6] if defined $_->[6] } @$ret ); $health{checks} = scalar( grep { /(?:check|alter|repair)/i } map { $_->[6] if defined $_->[6] } @$ret ); $health{checks} += scalar( grep { /(?:check|alter|repair)/i } map { $_->[7] if defined $_->[7] } @$ret ); $dbh->disconnect; debug(4, "OK"); return \%health; } debug(4, "Problem"); $dbh->disconnect; return 0; } ## }}} ## sub check_traceroute {{{ sub check_traceroute { my $host = shift; debug(4, "Running traceroute for $host: "); my $tr; my $hops; eval { ## Lets see some traceroute info: $tr = Net::Traceroute->new( host => $host, timeout => 2, queri +es => 1, use_icmp => 1 ) or die "Cannot initialize Net::Traceroute: $!"; if ( $tr->found ) { debug(4, "OK - " . $tr->hops . " hops."); $hops = $tr->hops; } }; if ($@) { debug(1, "Problem with traceroute: $@"); } if ( $hops ) { return $hops; } debug(4, "Problem!"); return 0; } ## }}} ## }}} sub debug { my $level = shift; my $message = shift; my $nostamp = shift; my $stamp = ''; unless ( $nostamp ) { $stamp = "$time{'yyyy/mm/dd hh:mm:ss'} $$: "; } if ( $DEBUG >= $level ) { Net::Server::write_to_log_hook($server, 3, $stamp . $message); } }

Can someone please shed some light on this? I'm at my wits end trying to debug it!

Readmore tags added by GrandFather

Replies are listed 'Best First'.
Re: Failures in TCP/IP stack
by bunnyman (Hermit) on May 27, 2006 at 22:57 UTC

    Without even looking at the code, it sounds like you are leaking file descriptors. Are you closing everything that you open? Check how many FDs you have open by looking into /proc/$$/fd/

    As a note on style, you are opening a database handle in one sub and closing it in another sub. I'd rather see the dbh closed in the calling sub instead of being passed into another sub and closed there.

      Thanks for the tip on the file descriptors.... I'll take a look.
      As for the dbh handle, I understand your point, but I was reusing the dbh to attempt to speed up the process... connection time is costly. Perhaps it would be wiser for me to merge the connection time sub with the db process checking sub instead.
Re: Failures in TCP/IP stack
by BrowserUk (Patriarch) on May 28, 2006 at 03:25 UTC

    Have you checked for memory growth over time?

    Try running the main bulk of your code, the checks, in a standalone script that doesn't use preforking, see whether that changes things.

    If you can just loop calling the checks as fast as your network admin will let you get away with, on a test network if such is available, or after hours, weekends, whatever makes sense in your environment. Perhaps only using a single machine as the target of your tests, maybe your own. The idea being to run the components as fast as you can to artificially excaserbate the problem. If you can get the problem to occur in a reasonably short period of time, then it becomes much easier to exclude the various subcomponents one at a time and try and isolate the fault that way.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm thinking that there is actually a bug in either the DBI or in DBD::mysql that is causing the file descriptors to hang open on failed connections. Anyone have any ideas, or should I start looking through the DBI code to track it down?

        First off, try running (a copy of) your code with the DBI stuff commented out. If your problem goes away, you know that's where to look.

        Then, produce a cut down verion of your deamon script that only contains the DBI checks against a single, test installation of MySQL, and try to reproduce the problem by setting the frequency as high as you can.

        At that point you should have a much smaller script that reproduces the problem much more quickly. If you then post that here, the DBI experts here (not me!), will be much more likely to take the time to review your shorter code and perhaps spot the problem or offer suggestions as to a way forward.

        I have a vague recollection that by default (back at version 3.something), MySQL hung on to unclosed connections for something like 900 seconds? And that there was a configuration option (at the server end) to have connections timeout more quickly. My recollection may be wrong, and it probably wouldn't produce the symptoms you are seeing, but it the kind of thing that those with good MySQL experience may spot for you, once you have isolated the problem and posted a concise script that reproduces it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Failures in TCP/IP stack
by hubb0r (Pilgrim) on May 28, 2006 at 04:33 UTC
    So, it looks like bunnyman has the correct source of the problem... leaking file descriptors. Way to catch that one!

    So here is where it stands. Here is the current check database sub (connection part only):
    sub check_database_health { my $db_host = shift; my %health; debug(4, "Checking connect time for $db_host"); my $dbh; my $connect_time; my $seconds = 2; my $mask = POSIX::SigSet->new( SIGALRM ); # signals to mask in +the handler my $action = POSIX::SigAction->new( sub { die "connect timeout" }, # the handle +r code ref $mask, ); my $oldaction = POSIX::SigAction->new(); sigaction( &POSIX::SIGALRM, $action, $oldaction ); eval { alarm($seconds); my $start_time = time; $dbh = DBI->connect("DBI:mysql:database=test;host=$db_host", ' +check_health', '******', { RaiseError => 1, PrintError => 0 }); or die "Could not connect to $db_host: " . $DBI::errstr; my $end_time = time; $connect_time = $end_time - $start_time; alarm 0; # cancel alarm (if code ran fast) }; sigaction( &POSIX::SIGALRM, $oldaction ); # restore original sign +al handler if ( $@ ) { debug(4, "Problem!: $@"); return undef; } else { $health{connect_time} = $connect_time; debug(4, "OK - $connect_time seconds"); } }
    I'm using this to get a connection, and to log the time it takes to connect. If I cannot connect, I need to know that too. So, there are a couple of servers in the list of servers that are not running their databases right now, and every time DBI fails to connect to those machines, it leaves its file descriptor around ( 10 -> socket:16194434 ) forever. I am explicitly closing my connections when I have made one, but in the case of DBI not being able to connect, what do I need to do to ensure that those file descriptors get cleaned up?
Re: Failures in TCP/IP stack
by hubb0r (Pilgrim) on May 29, 2006 at 04:39 UTC
    So I seem to have repaired the problem by removing the calls to POSIX::Sigaction for the alarm signal handling wrapping the DBI connection code. This is, unfortunately, contrary to the perldoc for DBI which states that the POSIX::Sigation method should be used if your perl is greater than 5.8.0.

    Thanks all for your help and input on this problem!
Re: Failures in TCP/IP stack
by Anonymous Monk on May 27, 2006 at 23:27 UTC
    There is a bug inside your code, it is on the left side, but you may not want to touch it, it could be poisonous.

    Considered by jdporter: reap - more troll trash
    Unconsidered by planetscape: keep (and edit) votes prevented reaping (keep: 7 edit: 0 reap: 31)