hubb0r has asked for the wisdom of the Perl Monks concerning the following question:

Hello thelonious monks!

I have an issue that I'm having a terrible time figuring out. I have a small server app that listens on a socket and feeds a list of server names to a client that connects (this is internal to my network). The application retrieves it's list of servers to serve from another perl script that does some health testing on the allowable machines and returns an ordered list to this server app.

Ok, so it is all working really well, except for one little thing. The program that does the checking causes a little load on the servers that it checks, and this "server server" can get a pretty good load up on it. I decided that it was necessary to make the "server server" cache the content that it is serving for up to 10 seconds, and to refresh it's list if it gets a request where the data is more than 10 seconds stale.

So, this works for some of the listening servers, but not all. The "last_update_time" is stored at a global level for the server (although I've tried many ways of setting it, including passing a reference to it) and there is one or more servers that do not update the time no matter what I've tried. There is extensive logging going on, and it is very difficult to tell which is the parent and which is the child and which is not handling the requests as the pid changes with each request (I don't fully understand the module implementation, but it appears to be very apache-like in nature).

Either way, here is the server code:

#!/usr/local/bin/perl -w package speak_db_names; use Net::Server::PreForkSimple; @ISA = qw{Net::Server::PreForkSimple}; use strict; use Time::Format qw{%time}; use Data::Dumper; use File::Basename; my $basedir = dirname($0); my $refresh_time = 10; my $last_server_update; my $servers = [ ]; $servers = load_data($servers, undef); speak_db_names->run(conf_file => 'speak_db_names.conf'); ## sub process_request {{{ sub process_request { my $self = shift; $servers = load_data($servers, $self); eval { local $SIG{ALRM} = sub { die "Timed Out" }; my $timeout = 2; my $previous_alarm = alarm($timeout); print join( ',', @$servers); alarm($timeout); } } ## }}} ## sub load_data {{{ sub load_data { my $servers = shift; my $self = shift; unless ($last_server_update) { $last_server_update = 0 } if ( (time - $last_server_update) > $refresh_time ) { Net::Server::write_to_log_hook($self, 3, 'DB Cache expired: ' . (time - $last_server_ +update) . ' Seconds Old... Relo +ading'); Net::Server::write_to_log_hook($self, 3, 'Now: ' . time . " Last Upd: $last_ser +ver_update"); my $ret = `$basedir/check_database_health.pl .1 2>>/var/log/ch +eck_db_names.log`; $last_server_update = time; @$servers = split(/\t/, $ret); return $servers; } else { Net::Server::write_to_log_hook($self, 3, 'Reusing DB Cache ' . (time - $last_server_ +update) . ' Seconds old'); return $servers; } } ## }}} 1;
Here is the config file for the Net::Server::PreForkSimple:
min_spare_servers 1 min_servers 2 max_spare_servers 2 max_servers 3 max_requests 1000 check_for_dead 120 log_file /var/log/speak_db_names.log log_level 3 pid_file /tmp/speak_db_names.pid background 1 port 3307 child_communication 1
and here is a snippet of log file showing the issue:
Starting "1" children Starting "1" children 2005/11/28-19:43:35 CONNECT TCP Peer: "172.17.10.57:36685" Local: "172 +.17.10.75:3307" DB Cache expired: 474444 Seconds Old... Reloading Now: 1133225015 Last Upd: 1132750571 2005/11/28-19:43:36 CONNECT TCP Peer: "172.17.10.57:36688" Local: "172 +.17.10.75:3307" DB Cache expired: 474445 Seconds Old... Reloading Now: 1133225016 Last Upd: 1132750571 2005/11/28-19:43:36 CONNECT TCP Peer: "172.17.10.57:36689" Local: "172 +.17.10.75:3307" DB Cache expired: 474445 Seconds Old... Reloading Now: 1133225016 Last Upd: 1132750571 2005/11/28-19:43:36 CONNECT TCP Peer: "172.17.10.57:36690" Local: "172 +.17.10.75:3307" Reusing DB Cache 1 Seconds old 2005/11/28-19:43:36 CONNECT TCP Peer: "172.17.10.57:36691" Local: "172 +.17.10.75:3307" Reusing DB Cache 0 Seconds old 2005/11/28-19:43:37 CONNECT TCP Peer: "172.17.10.57:36692" Local: "172 +.17.10.75:3307" Reusing DB Cache 1 Seconds old 2005/11/28-19:43:37 CONNECT TCP Peer: "172.17.10.57:36693" Local: "172 +.17.10.75:3307" Reusing DB Cache 1 Seconds old 2005/11/28-19:43:38 CONNECT TCP Peer: "172.17.10.57:36694" Local: "172 +.17.10.75:3307" Reusing DB Cache 2 Seconds old Starting "1" children Starting "1" children 2005/11/28-19:43:38 CONNECT TCP Peer: "172.17.10.57:36695" Local: "172 +.17.10.75:3307" Reusing DB Cache 2 Seconds old 2005/11/28-19:43:39 CONNECT TCP Peer: "172.17.10.57:36696" Local: "172 +.17.10.75:3307" Reusing DB Cache 3 Seconds old 2005/11/28-19:43:39 CONNECT TCP Peer: "172.17.10.57:36697" Local: "172 +.17.10.75:3307" Reusing DB Cache 3 Seconds old 2005/11/28-19:43:40 CONNECT TCP Peer: "172.17.10.57:36698" Local: "172 +.17.10.75:3307" Reusing DB Cache 4 Seconds old 2005/11/28-19:43:40 CONNECT TCP Peer: "172.17.10.57:36699" Local: "172 +.17.10.75:3307" Reusing DB Cache 4 Seconds old 2005/11/28-19:43:40 CONNECT TCP Peer: "172.17.10.57:36700" Local: "172 +.17.10.75:3307" DB Cache expired: 474449 Seconds Old... Reloading Now: 1133225020 Last Upd: 1132750571 2005/11/28-19:43:41 CONNECT TCP Peer: "172.17.10.57:36701" Local: "172 +.17.10.75:3307" DB Cache expired: 474450 Seconds Old... Reloading Now: 1133225021 Last Upd: 1132750571

As you can see, the instances that do not update the time have never updated it from the initial load of the daemon. The others update it all the time.

Please help! It seems that more often than not the server that does not update the time gets hit, which causes both a strain on the machines being checked as well as a longer connection time waiting for the check to occur.

Thanks so much for your help!

-Hubb0r

Replies are listed 'Best First'.
Re: Net::Server::PreForkSimple and Children
by salvix (Pilgrim) on Nov 30, 2005 at 16:42 UTC
    You're assuming that your variables will be shared when you fork(), but that's not true. Each child process has its own copy of that variable, so changes are local to each child process.

    One solution would be to use IPC-Shareable.

Re: Net::Server::PreForkSimple and Children
by Rhandom (Curate) on Nov 30, 2005 at 17:19 UTC
    Funny you should ask. This is one of the more common questions fielded for Net::Server.

    In either of PreFork or PreForkSimple you are going to have this problem. As salvix already pointed out, forked processes get a snapshot of the state of the parent at the time of fork. Any updates to the parent after that point will not be reflected in the already forked processes. You need a method to share the data.

    For caching the data I'd recommend one of Cache::Memcached or Cache::FastMmap. Both of these are extremely fast. If all you are doing with the server is serving the cached data then Cache::Memcached already does everything you need (if you are doing more and just want a way to share common data then you still will need the Net::Server).

    In the end I'd benchmark a few of the Cache systems (Memcached, FastMmap, and IPC::Shareable) and see which one performs the best under your load circumstances.

    Another solution that is even faster - assuming you don't care about whether the data is shared between the processes (if you don't care that each process has a full copy of the data in memory) - you could go with a hybrid approach where the parent updates the Cache and the child processes only periodically check the cache for updates and then copy the entire cache into perl variables local to the child processes. For 99% of the server needs this would probably be overkill.

    I think it is about time to add a section discussing this to the PreFork perldoc.

    my @a=qw(random brilliant braindead); print $a[rand(@a)];
      So that was the right answer for the wrong question. You are doing what I suggested in the paragraph beginning with "Another solution" which is that each child occasionally loads its own copy of the data - periodically.

      Your implementation appears sound. I think it would be helpful to also log the pid with each of your log messages so we could get a better idea which ones are reloading - and if the same one is reloading.

      It is entirely possible that some of your preforked processes haven't been touched yet - and when they finally are then you get the old time stamp you have shown.

      my @a=qw(random brilliant braindead); print $a[rand(@a)];

        Thanks for your replies... and yes it would be nice to have a bit more documentation regarding this type of scenario in the Prefork or PreforkSimple docs... The entire Net::Server module is very different than what I'm used to using in the way that I had to modularize my program and have it call itself during execution.

        So, I actually did go and turn on pid logging after posting this message, and have been watching what's been happening at this point. I've also changed the config file to spawn only one server at a time so I can attempt to see exactly what is happening:

        min_servers 1 max_servers 1 max_requests 1000 log_file /var/log/speak_db_names.log log_level 3 pid_file /tmp/speak_db_names.pid background 1 port 3307

        Here is a current snippet of the logfile:

        16265 - Reusing DB Cache 0 Seconds old 2005/11/30-15:30:00 CONNECT TCP Peer: "172.17.10.67:50661" Local: "172 +.17.10.75:3307" 16265 - Reusing DB Cache 0 Seconds old 2005/11/30-15:30:01 CONNECT TCP Peer: "172.17.10.67:50668" Local: "172 +.17.10.75:3307" 16265 - Reusing DB Cache 1 Seconds old 2005/11/30-15:30:02 CONNECT TCP Peer: "172.17.10.67:50673" Local: "172 +.17.10.75:3307" 16265 - Reusing DB Cache 2 Seconds old 2005/11/30-15:30:03 CONNECT TCP Peer: "172.17.10.57:35875" Local: "172 +.17.10.75:3307" 16265 - Reusing DB Cache 3 Seconds old Starting "1" children 2005/11/30-15:30:06 CONNECT TCP Peer: "172.17.10.67:50683" Local: "172 +.17.10.75:3307" 16269 - DB Cache expired: 135784 Seconds Old... Reloading 16269 - Now: 1133382606 Last Upd: 1133246822 2005/11/30-15:30:06 CONNECT TCP Peer: "172.17.10.67:50687" Local: "172 +.17.10.75:3307" 16269 - Reusing DB Cache 0 Seconds old Starting "1" children 2005/11/30-15:30:09 CONNECT TCP Peer: "172.17.10.57:35883" Local: "172 +.17.10.75:3307" 16273 - DB Cache expired: 135787 Seconds Old... Reloading 16273 - Now: 1133382609 Last Upd: 1133246822 2005/11/30-15:30:10 CONNECT TCP Peer: "172.17.10.57:35885" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 1 Seconds old 2005/11/30-15:30:10 CONNECT TCP Peer: "172.17.10.57:35887" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 1 Seconds old 2005/11/30-15:30:12 CONNECT TCP Peer: "172.17.10.57:35889" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 3 Seconds old 2005/11/30-15:30:13 CONNECT TCP Peer: "172.17.10.57:35891" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 4 Seconds old 2005/11/30-15:30:14 CONNECT TCP Peer: "172.17.10.67:50706" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 5 Seconds old 2005/11/30-15:30:15 CONNECT TCP Peer: "172.17.10.57:35893" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 6 Seconds old 2005/11/30-15:30:15 CONNECT TCP Peer: "172.17.10.57:35895" Local: "172 +.17.10.75:3307" 16273 - Reusing DB Cache 6 Seconds old Starting "1" children 2005/11/30-15:30:33 CONNECT TCP Peer: "172.17.10.57:35897" Local: "172 +.17.10.75:3307" 16276 - DB Cache expired: 135811 Seconds Old... Reloading 16276 - Now: 1133382633 Last Upd: 1133246822 Starting "1" children 2005/11/30-15:30:58 CONNECT TCP Peer: "172.17.10.57:35899" Local: "172 +.17.10.75:3307" 16279 - DB Cache expired: 135836 Seconds Old... Reloading 16279 - Now: 1133382658 Last Upd: 1133246822 Starting "1" children 2005/11/30-15:31:25 CONNECT TCP Peer: "172.17.10.57:35901" Local: "172 +.17.10.75:3307" 16282 - DB Cache expired: 135863 Seconds Old... Reloading 16282 - Now: 1133382685 Last Upd: 1133246822 2005/11/30-15:31:26 CONNECT TCP Peer: "172.17.10.57:35902" Local: "172 +.17.10.75:3307" 16282 - Reusing DB Cache 0 Seconds old Starting "1" children 2005/11/30-15:31:29 CONNECT TCP Peer: "172.17.10.57:35904" Local: "172 +.17.10.75:3307" 16285 - DB Cache expired: 135867 Seconds Old... Reloading 16285 - Now: 1133382689 Last Upd: 1133246822 2005/11/30-15:31:29 CONNECT TCP Peer: "172.17.10.57:35905" Local: "172 +.17.10.75:3307" 16285 - Reusing DB Cache 0 Seconds old Starting "1" children 2005/11/30-15:31:37 CONNECT TCP Peer: "172.17.10.57:35907" Local: "172 +.17.10.75:3307" 16288 - DB Cache expired: 135875 Seconds Old... Reloading 16288 - Now: 1133382697 Last Upd: 1133246822 2005/11/30-15:31:37 CONNECT TCP Peer: "172.17.10.57:35909" Local: "172 +.17.10.75:3307" 16288 - Reusing DB Cache 0 Seconds old 2005/11/30-15:31:37 CONNECT TCP Peer: "172.17.10.57:35911" Local: "172 +.17.10.75:3307" 16288 - Reusing DB Cache 0 Seconds old 2005/11/30-15:31:38 CONNECT TCP Peer: "172.17.10.57:35913" Local: "172 +.17.10.75:3307" 16288 - Reusing DB Cache 1 Seconds old Starting "1" children

        So I'm still confused. My config file says: spawn one server (min/max) to deal with requests, and serve 1000 requests per server, right? So if you look at the log file, you will see that sometimes on process is used multiple times, but then a child will be spawned, and seemingly the next x number of calls call different child processes, none of which have the updated cache information. I don't mind a single process getting the cache information when it expires, but I don't want them getting it every call, and some of them seem to do so.

        Maybe you can see what I'm doing wrong, if anything at all, or maybe explain a bit more to me about what is happening. I've only done a few other programs that fork, and none of them have been a client/server type, so this is a bit new to me, although I think I understand much of the concept.

        Thanks