dbmathis:
I have a group of about 150 linux application
servers that a process runs on nightly and then a SUCCESS gets written
to a logfile of each of the servers when the process completes.
Currently I have to log into each server via ssh and grep each
log to see if the process completed.
YMMV, but I had (and have) to deal with a similar problem
in a "computational chemistry" environment. The number of servers
or nodes is about one half of yours.
What I learned from all that: "keep it dead simple" try to
get it installed OOTB -if possible.
My current solution:
1. programs & logging
- One of the (older) boxes poses as server and holds the
node cluster in a subnet (a private one in my case)
- The server exposes (NFS,SMB possible) its /usr/local/bin (ro-mode) and
its /srv/cluster (rw-mode) to the subnet,
- The nodes load their applications from the central mounted
/usr/local/bin and write logs with date and ip
(in filenames) into seperate files in /srv/cluster
2. job overview
- The server has some perl scripts for job overview,
if required, the number and respective
ip's of running nodes are found by "nmapping" the subnet:
...
# $addr is the actual subnet, e.g. "192.168.1.0"
...
my $output = qx{nmap -sP ${addr}/24};
my @nodes= $output =~ /(?<=\s)c\w+\b/g;
This (nmap -sP) will run very fast (at least here, from
a non-root account) and may provide a
"real time" info on running nodes per html page, eg.:
...
print header('text/html');
print h1('Local Network: '. $addr . '/24');
print map "$_ appears to be up<br />", @hosts
...
The found nodes might then be rsh'ed (if its a private
subnet, you won't be killed for using rsh/rexec then)
Pseudo:
...
my ($exe, $cmd) = ('/usr/bin/rsh', 'ps -fl r -u username');
my $cnt = 0;
for my $node ( sort @nodes ) {
my @res = grep !(/$cmd/ || /STIME/), split /[\n\r]+/, qx{$exe $node
+ $cmd};
my $nproc = scalar @res; # how many processes
if( $nproc ) {
print map
"Do " . "some ". "formatting of " . "ps -fl output here!",
@res
}
...
++$cnt
...
In the end, you'll have a browser-interface to the
running processes (build a nice html table in the "map"
above) and a central directory full of log files, which
might even be exported (smb) to windows machines for
coworker preferring the explorer ;-)
The only "complication" (additional work per node) would
be "installing and enabling the nfs client".
my €0.02
regards
mwa