clinton has asked for the wisdom of the Perl Monks concerning the following question:

I am using Schedule::Cron to run a cron daemon which executes some maintenance jobs, such as sending emails, clearing caches etc.

But I seem to have a race condition somewhere so that everything locks up on occassion, and I can't stop where it is going wrong.

The reason that the code with the lockup exists is because I want to:

This requires some semi-communication between the parent and child processes which requires a lock on the database(MySQL).

Help appreciated:

# Used for getting lock in the database my $lockname = $class.'::'.$handler; # Get exclusive lock on jobs table # clear_cache deletes the cached db handles in DBI # so that the next connect should get a fresh handle MyStuff::DB::clear_cache(); my $db = MyStuff::DB->connect('default'); my $lock = $db->dbh; $lock->do('LOCK TABLES jobs WRITE'); # Get existing info for job status $db = $db->select_row({ bind => ['CHAR'], params => [$lockname], SQL => 'SELECT pid,last_id,last_run||"1970-01-01" FROM job +s WHERE name = ?' }); my $results = $db->results; # If job is still running if ($results && $results->{pid} && kill 0 => $results->{pid}) { logmsg("Still running"); $lock->do('UNLOCK TABLES'); $Starts{$lockname}++; if ($Starts{$lockname}>5) { warn "Tried to start $class => $handler $Starts{$lockname} + times"; } return; } # Start new job logmsg ("Starting '$lockname'"); my $current_time = timestamp(); # Last time the job was run # the last ID processed, or the last # time it was run was.... my $last_run = $results->{last_run} ? $results->{last_run}->strftime('%F %T') : '1970-01-01'; my $last = { id => $results->{last_id}||0, run => $last_run, time => $current_time, }; # Fork you my $pid = fork; unless (defined $pid) { warn "Couldn't fork to start $class => $handler : $!"; return; } if ($pid) { # Parent # Update jobs table with new PID # THIS IS THE POINT AT WHICH IT HANGS # AND IN THE DB LOGS, IT HAS THE SAME CONNECTION # ID AS THE HANDLE WHICH OBTAINED THE LOCK $db = $db->replace({ bind => ['CHAR','INT','INT','DATE'], params => [$lockname, $pid, $last->{id}, $last->{run}], SQL => <<SQL}); REPLACE INTO jobs ( name ,pid ,last_id ,last_run ) VALUES (?,?,?,?) SQL $lock->do('UNLOCK TABLES'); $Starts{$lockname}=0; return; } # Child # Get rid of old database connections MyStuff::DB::clear_cache(); chdir '/' or die $!; open STDIN, '/dev/null' or die $!; # Checking that my job has had my PID set for me $db = MyStuff::DB->connect('default.write'); $db->select_row({ bind => ['CHAR'], params => [$lockname], SQL => <<SQL}); SELECT pid FROM jobs WHERE name = ? SQL my $results = $db->results; unless ($results && $results->{pid} == $$) { die "Jobs table not locked for me" } # Run handler eval {$class->$handler($last,$args)}; if ($@) { die "Error running '$lockname' : $@"; } # Reset jobs table $db = MyStuff::DB->update({ db => 'default.write', bind => ['INT','DATE','CHAR','INT'], params => [$last->{id}, $last->{time}, $lockname, $$], SQL => <<SQL}); UPDATE jobs SET pid = 0 , last_id = ? , last_run = ? WHERE name = ? AND pid = ? SQL logmsg ("Ending '$lockname'"); exit 0; }

Replies are listed 'Best First'.
Re: Race condition in my cron daemon
by perrin (Chancellor) on Mar 20, 2006 at 18:17 UTC
    Handling database connections in a forking server is very difficult to get right. In this case I recommend that you set the InactiveDestroy property on the connection after forking, and then open a new connection to use for any further database interaction. Do not try to continue using a handle opened in the parent process from the child process.
      Perrin - thanks - I'll give the InactiveDestroy a go. I didn't know about that.

      But I have a feeling that that isn't the problem, because the same connection ID is used to obtain the lock and to update the jobs table (both of which happen in the parent process) - and this I can see in the database log. The REPLACE statement finally runs when I shut down the server, so something somewhere is hanging onto that lock.

      As far as reusing the connection in the child, I specifically clear out the DBI connection cache and request a new connection in the child. Again, in the logs, I can see that the parent and child are using different, new, connections.

        I strongly suggest following perrin's advice, and not discounting it unless it doesn't work.

        The likely problem that perrin noted is that the database server winds up being talked to by both parent and child at the same time. And the database gets confused, resulting in unpredictable behaviour.

        It isn't visible to you here because the race is from an implicit action that you don't see in your code. When you call MyStuff::DB::clear_cache() you remove the database handle in the child. That calls the handle's DESTROY method, which is likely to do cleanup, including telling the database, "I'm all done here." If at the same time the parent is trying to tell the database "Please do this work" the database can get all confused in a million ways. For instance the two messages might confuse the database into thinking that it hasn't yet received the full message to act on so it is waiting for the rest of the message, while the parent process is waiting for a response - leading to a hang.

        Use the InactiveDestroy parameter and the problem goes away because the child is no longer talking to the database behind the parent's back.

      I have added a $dbh->{InactiveDestroy} = 1 in the child process, and so far so good. It is sporadic, so I won't know that it is fixed until it has run for a while longer, but looking good so far.

      Many thanks Perrin

        Based on your description, it may be something else. If that doesn't fix it, let us know.
Re: Race condition in my cron daemon
by lima1 (Curate) on Mar 20, 2006 at 18:20 UTC
    is use Proc::PID_File for my cronjobs. No problems with that until now.

    why do you need this PID checking with a database?

    UPDATE: Should have read your code better...Ignore this post