hallikpapa has asked for the wisdom of the Perl Monks concerning the following question:

I have a client/server setup of two perl scripts. One sits on a box and reads in a log file, and on the other box, it connects to the server and tails the file through the socket. This has worked ok for a while, but it seems to have started acting weird lately. Just to put this out there, there is no connection problem between the two machines outside the scripts, latency is almost nothing, and there is only one hop in a traceroute. It's I have come to the conclusion it is the script. So here we have the script on the server side that is reading the log file and passing data thru the socket:
use strict; use lib '/opt/nextone/lib/perl5/site_perl/5.8.3/'; use IO::Socket; use LogFile; use File::Tail; use Sys::Hostname; use Time::Local; use POSIX qw(strftime); use constant PORT => 9204; use constant PIDFILE => '/var/run/cdr_hup.pid'; use constant USER => 'root'; use constant GROUP => 'root'; use constant CDR_HOME => '/var/cdrs'; # signal handler for child die events $SIG{TERM} = $SIG{INT} = \&do_term; $SIG{HUP} = \&do_hup; our $connection; my $quit=0; my $port = $ARGV[0] || PORT; my @allow=('IP ADDRESS OF INCOMING CONNECTION'); my ($ip_addr, $valid, $log_file); $log_file ='/var/log/log_server'; init_log($log_file); log_notice "Start Server\n"; my $listen_socket = IO::Socket::INET->new(LocalPort => $port, Listen => 20, Proto => 'tcp', ReuseAddr => 1) or die "Can't bind : $@\n";; #log_notice "Can't create a listening socket: $@\n" unless $listen_soc +ket; #my $pid = init_server(PIDFILE,USER,GROUP,$port); log_notice "Server accepting connections on port $port\n"; while (!$quit) { next unless $connection = $listen_socket->accept; log_notice "Client connecting\n"; my $host = $connection->peerhost; foreach $ip_addr (@allow) { $valid=0; if($host == $ip_addr) { $valid=$ip_addr; last; } } if(!$valid) { $connection->close; log_warn "Connection attempt from invalid host : $host\n"; next; } $connection->setsockopt(SOL_SOCKET, SO_KEEPALIVE=>1); log_notice("Accepting a connection from $host\n"); interact($connection); log_notice("Connection from $host closed\n"); $connection->close; } # End While !quit sub interact { my $sock = shift; STDIN->fdopen($sock,"r") or die "Can't reopen STDIN: $!"; STDOUT->fdopen($sock,"w") or die "Can't reopen STDOUT: $!"; STDERR->fdopen($sock,"w") or die "Can't reopen STDERR: $!"; $| = 1; my ($cdr_path, $cdr_file, $record, $line, $file, $thishost, $filenam +e); $cdr_path="/var/cdrs"; $cdr_file=gen_name(); $record=0; $thishost=hostname; $filename=$cdr_path.'/'.$cdr_file; while ( <$sock> ) { until (-e $filename) { sleep 10; } next unless /\S/; # blank line log_notice "Tailing $filename\n"; $file=File::Tail->new(name=>$filename, maxinterval=>5, interval= +>1, tail=>-1, errmode=>\&do_exit); while ( defined($line=$file->read) ) { $record++; print $sock $record.";".$cdr_file.";".$thishost.";".$l +ine; } # End While # 2 &do_term($sock); } # End While sock } #End interact sub do_exit { log_notice("End Tail ... Closing Socket\n"); if (defined $connection) { $connection->shutdown(2); } sleep 2; exec '/usr/local/bin/logserver' || log_warn ("LOGSERVER : Could not exec $0\n"); exit 9; # Should not get here } # End do_exit sub do_term { log_notice("TERM signal received, terminating\n"); exit 0; }
And on the client side, here is just the connection part, most everything else is just how to handle the data it gets:
CONNECT: $socket = IO::Socket::INET->new("$host:$port"); unless ($socket) { # Potential Problem here $retry++; if($retry==3) { log_die ("Client : Error connecting to server ... $@\n +"); } log_notice ("Client : Retry $retry : Sleeping $timeout Second +s\n"); sleep($timeout); # wait a minute; goto CONNECT; } $retry=0; #Reset in case we disconnect early $connected=1; #So that we know... #Start from beginning, uncomment if you don't want too $recover_from=is_recovery($dbh); $recover_from++; $rec_num=0; $stCnt->finish; if ($recover_mode) { log_notice ("Client : Begin recovery mode from record number $ +recover_from\n"); } $SIG{TERM}=$SIG{INT}=$SIG{PIPE}=$SIG{CHLD} = sub { my $sig=shift; if (defined $sth) { $sth->finish; } if (defined $dbh ) { $dbh->disconnect; } if (defined $socket) { $socket->shutdown(2); } sleep(5); do { $pid = waitpid(-1,&WNOHANG); } until $pid == -1; log_notice ("Client : Caught signal SIG$sig\n"); if($sig eq 'PIPE') { log_notice ("Connection Ended\n"); } if($sig eq 'TERM' || $sig eq 'HUP' || $sig eq 'INT') { log_notice ("User Cancelling?\n"); exit 1; #User cancelling job ? } if ($connected) { ############################################################# # First check to see if this is early termination # Should only terminate at midnight. If this is # an early termination the assumption is that the # system we were connected to has croaked. If the # current time is less than $midnight then we try # to exec to ourself (RESTART..) where, hopefully # we will reconnect and recover where we left off # or give up after three times the $timeout. ############################################################# $timestamp=time; if($timestamp < $midnight) { log_notice ("Client : Restarting $0\n"); log_error("End Processing $src_cdr_file\n"); exec '/home/$0' || log_warn ("Client : Could not exec $0\n"); exit 6; # Something is wrong if this exit is taken } # End if $timestamp log_notice ("Client : Normal Termination\n"); log_error("End Processing $src_cdr_file\n\n"); exec '/usr/bin/perl', '/home/client' || log_warn ("Client : Could not exec $0\n"); exit 7; # Something is wrong if this exit is taken } # End if $connected }; # End anonymous sub print $socket "tail\n"; # Rock n Roll
Now I commented out the part where it does the is_recovery subroutine. The reason I did this is because the pipe between the client and server is never broken, but it just stops updating. If I restart the process, the recovery subroutine checks to see what line I was just on when the process was shut down and continues on from there. Because there seems to be some kind of timeout, or problem, I have it start from the beginning of the file to make sure it didn't miss anything. Also, I kept restarting the process and with the recovery part in there, and it wouldn't continue reading in files. May be a flaw there. Any ideas? I really need help on this one. Just doesn't make sense why few times a week now it just stops reading, but none of the processes die...

Replies are listed 'Best First'.
Re: Weird IO::Socket problem
by thparkth (Beadle) on Jan 14, 2008 at 18:30 UTC
    I think it would be useful for you to narrow the problem down. From what I understand, the client is connected - and remains connected - to the server, but the data stops flowing over the connection despite the fact that the file you're tailing continues to grow.

    I suggest writing a quick test script to run File::Tail on the file in exactly the same way as you do in interact():

    $filename=... $file=File::Tail->new(name=>$filename, maxinterval=>5, interval=+>1, t +ail=>-1, errmode=>\&do_exit); while ( defined($line=$file->read) ) { print localtime()." ".$line; }
    I suggest you run this on the server host, in parallel with your existing client/server code. Redirect the output to some file. If the problem recurs, you check to see if this little test script also failed. If so, you know the problem is with tail. If not, it's most likely to be a network issue.

    Alternatively of course, you could add some extra logging to your existing server code. You'll probably end up having to do that anyway.

    In general I don't think you can collect enough debugging information in a situation like this. Make every part of your code print out something to a log to tell you what it's doing.

      OK I caught the interface dropping packets. It's a Gig-E interface, and it does retransmit dropped packets. My question is what can I do about this on the script side? It seems as though if packets get dropped a lot, the script pauses and doesn't recover. I would like it to be able to capture re-transmitted packets and continue on where it left off. Is this possible with IO::Socket? Is it possible in POE? I am eventually going to rewrite the whole process using POE, but this may speed up the process.
        A small number of dropped packets shouldn't be a problem, but if you are having spikes where very many packets are being dropped - the majority, say - this will kill the performance of your TCP connection, possibly to the point where it's indistinguishable from "not working at all."

        If you want to debug the TCP side of it I suggest using tcpdump or maybe ethereal to see what happens during the connection. You will need to understand the TCP protocol in some detail (but this is WORTH knowing!) and you might have to collect gigs and gigs of capture log before you actually see the fault occur - but it should be revealing.

        On the other hand, if you're fairly certain that the problem you're seeing corresponds with major spikes of packet loss, there's not much point spending a lot of effort to confirm something that you already know - the network is broken and that's hurting your script.

        This really isn't a Perl issue any more; but you should not be seeing packet loss in a typical GigE LAN environment. Something is probably wrong. Maybe your host can't actually keep up with the network traffic (if you're using a 486 machine, I'd start with that!). Maybe the cabling isn't up to CAT6 spec; maybe the switch fabric is overloaded; maybe someone has an industrial compressor on the same circuit as your switch (this has happened to me). Maybe your routing is seriously screwed up and all your supposed LAN traffic is going over and back on a 64K satellite link to Helsinki (this has also happened to me!).

        But I'd certainly try to pin down the cause of that packet loss.

        You might also want to check your duplex settings. Normally GigE should be set to full duplex on both sides, and not to autonegotiate.

        One of the most common findings when a Ethernet connection is losing packets or is just slow, is that one side or the other was set to autonegotiate and wound up in half duplex. If both sides start to transmit at about the same time, the half duplex side will see a collision, transmit a jamming signal, and assume that the other side is going to back off and retransmit... but the full duplex side is not looking for collisions and will never retransmit.

      I will give this a shot. I appreciate the suggestion.
Re: Weird IO::Socket problem
by roboticus (Chancellor) on Jan 14, 2008 at 17:04 UTC
    Perhaps log file rotation is biting you?

    ...roboticus

      It's not the rotation. As a hack we have the process stop at midnight and restart one minute later. It dies at random times. There is one log per day that starts at midnight, and we start reading it at 0:01
        Perhaps the switch/router/net device between computers is timing out the connection and not telling you?

        You might try putting ethereal on both segments and have it log the traffic between the two computers so you can catch it when it stops working. Then you might be able to determine if it's a client problem, server problem or network problem.

        ...roboticus

Re: Weird IO::Socket problem
by hallikpapa (Scribe) on Jan 15, 2008 at 17:20 UTC
    In case anyone was wondering, it was defintely a network problem. I am speeding up the process to redo this in POE so it can spawn child processes, but basically it was streaming way too much uneeded data, and there were lots of errors on the interface. I filtered out a lot of crap and BOOSH, loookin good.