Race condition in my cron daemon

clinton has asked for the wisdom of the Perl Monks concerning the following question:

I am using Schedule::Cron to run a cron daemon which executes some maintenance jobs, such as sending emails, clearing caches etc.

But I seem to have a race condition somewhere so that everything locks up on occassion, and I can't stop where it is going wrong.

The reason that the code with the lockup exists is because I want to:

Check whether the last incarnation of the job is still running by checking for the PID
if so, increment $Starts for that job
if it is STILL running after 5 attempts, then complain, cos each job shouldn't take that long

This requires some semi-communication between the parent and child processes which requires a lock on the database(MySQL).

Help appreciated:

    # Used for getting lock in the database
    my $lockname = $class.'::'.$handler;

    # Get exclusive lock on jobs table
    # clear_cache deletes the cached db handles in DBI
    # so that the next connect should get a fresh handle
    MyStuff::DB::clear_cache();
    my $db = MyStuff::DB->connect('default');
    my $lock = $db->dbh; 
    $lock->do('LOCK TABLES jobs WRITE');

    # Get existing info for job status
    $db = $db->select_row({
        bind    => ['CHAR'],
        params  => [$lockname],
        SQL     => 'SELECT pid,last_id,last_run||"1970-01-01" FROM job
+s WHERE name = ?'
    });
    my $results = $db->results;

    # If job is still running
    if ($results && $results->{pid} 
        && kill 0 => $results->{pid}) {
        logmsg("Still running");
        $lock->do('UNLOCK TABLES');
        $Starts{$lockname}++;
        if ($Starts{$lockname}>5) {
            warn "Tried to start $class => $handler $Starts{$lockname}
+ times";
        }
        return;
    }
    
    # Start new job
    logmsg ("Starting '$lockname'");
    
    my $current_time = timestamp();

    # Last time the job was run
    # the last ID processed, or the last 
    # time it was run was....    
    my $last_run = $results->{last_run}
        ? $results->{last_run}->strftime('%F %T')
        : '1970-01-01';
    my $last = {
        id      => $results->{last_id}||0,
        run     => $last_run,
        time    => $current_time,
    };
    
    # Fork you
    my $pid = fork;
    unless (defined $pid) {
        warn "Couldn't fork to start $class => $handler : $!";
        return;
    }

    if ($pid) {
        # Parent
        # Update jobs table with new PID
        # THIS IS THE POINT AT WHICH IT HANGS
        # AND IN THE DB LOGS, IT HAS THE SAME CONNECTION
        # ID AS THE HANDLE WHICH OBTAINED THE LOCK

        $db = $db->replace({
            bind    => ['CHAR','INT','INT','DATE'],
            params  => [$lockname,
                        $pid,
                        $last->{id},
                        $last->{run}],
            SQL     => <<SQL});
                REPLACE INTO jobs (
                    name
                    ,pid
                    ,last_id
                    ,last_run
                ) VALUES (?,?,?,?)
SQL

        $lock->do('UNLOCK TABLES');
        $Starts{$lockname}=0;
        return;
    }

    # Child
    # Get rid of old database connections
    MyStuff::DB::clear_cache();

    chdir '/' or die $!;
    open STDIN, '/dev/null' or die $!;


    # Checking that my job has had my PID set for me
    $db = MyStuff::DB->connect('default.write');

    $db->select_row({
        bind    => ['CHAR'],
        params  => [$lockname],
        SQL     => <<SQL});
            SELECT pid
            FROM jobs
            WHERE name = ?
SQL

    my $results = $db->results;
    unless ($results && $results->{pid} == $$) {
        die "Jobs table not locked for me"
    }

    # Run handler
    eval {$class->$handler($last,$args)};
    if ($@) {
        die "Error running '$lockname' : $@";
    }
    
    # Reset jobs table
    $db = MyStuff::DB->update({
        db      => 'default.write',
        bind    => ['INT','DATE','CHAR','INT'],
        params  => [$last->{id},
                    $last->{time},
                    $lockname,
                    $$],
        SQL     => <<SQL});
            UPDATE jobs
            SET pid = 0
                , last_id = ?
                , last_run = ?
            WHERE name = ?
                  AND pid = ?
SQL

    logmsg ("Ending '$lockname'");

    exit 0;
}
[download]

Comment on Race condition in my cron daemon Download Code

Replies are listed 'Best First'.
Re: Race condition in my cron daemon by perrin (Chancellor) on Mar 20, 2006 at 18:17 UTC
Handling database connections in a forking server is very difficult to get right. In this case I recommend that you set the InactiveDestroy property on the connection after forking, and then open a new connection to use for any further database interaction. Do not try to continue using a handle opened in the parent process from the child process.	[reply]
Re^2: Race condition in my cron daemon by clinton (Priest) on Mar 20, 2006 at 18:28 UTC
Perrin - thanks - I'll give the InactiveDestroy a go. I didn't know about that. But I have a feeling that that isn't the problem, because the same connection ID is used to obtain the lock and to update the jobs table (both of which happen in the parent process) - and this I can see in the database log. The REPLACE statement finally runs when I shut down the server, so something somewhere is hanging onto that lock. As far as reusing the connection in the child, I specifically clear out the DBI connection cache and request a new connection in the child. Again, in the logs, I can see that the parent and child are using different, new, connections.	[reply]
Re^3: Race condition in my cron daemon by tilly (Archbishop) on Mar 21, 2006 at 06:22 UTC
I strongly suggest following perrin's advice, and not discounting it unless it doesn't work. The likely problem that perrin noted is that the database server winds up being talked to by both parent and child at the same time. And the database gets confused, resulting in unpredictable behaviour. It isn't visible to you here because the race is from an implicit action that you don't see in your code. When you call MyStuff::DB::clear_cache() you remove the database handle in the child. That calls the handle's DESTROY method, which is likely to do cleanup, including telling the database, "I'm all done here." If at the same time the parent is trying to tell the database "Please do this work" the database can get all confused in a million ways. For instance the two messages might confuse the database into thinking that it hasn't yet received the full message to act on so it is waiting for the rest of the message, while the parent process is waiting for a response - leading to a hang. Use the InactiveDestroy parameter and the problem goes away because the child is no longer talking to the database behind the parent's back.	[reply]
Re^2: Race condition in my cron daemon by clinton (Priest) on Mar 22, 2006 at 12:33 UTC
I have added a `$dbh->{InactiveDestroy} = 1` in the child process, and so far so good. It is sporadic, so I won't know that it is fixed until it has run for a while longer, but looking good so far. Many thanks Perrin	[reply] [d/l]
Re^3: Race condition in my cron daemon by perrin (Chancellor) on Mar 22, 2006 at 20:48 UTC
Based on your description, it may be something else. If that doesn't fix it, let us know.	[reply]
Re^4: Race condition in my cron daemon by clinton (Priest) on Apr 10, 2006 at 19:02 UTC
Re^5: Race condition in my cron daemon by perrin (Chancellor) on Apr 17, 2006 at 15:17 UTC
Re^4: Race condition in my cron daemon by clinton (Priest) on Mar 27, 2006 at 10:04 UTC
Re: Race condition in my cron daemon by lima1 (Curate) on Mar 20, 2006 at 18:20 UTC
is use Proc::PID_File for my cronjobs. No problems with that until now. why do you need this PID checking with a database? UPDATE: Should have read your code better...Ignore this post	[reply]