flock LOCK_EX not locking exclusively

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a program that does tasks in tandem. It forks into some number of processes, and the children wait for specifically formatted text files to tell them exactly what to do. Since each text file should only be performed on once, I thought I should get an exclusive file lock on each file before actually doing the work. But files are sometimes worked on more than once despite the fact that an exclusive lock has returned true. Heres a simplified version of what I'm doing that recreates the error on my computer.

#!/usr/bin/perl

use strict;
use warnings;
use Time::HiRes qw|time sleep|;
use Fcntl qw|:flock|;

mkdir "tmp" or warn $!;
chdir "tmp" or die $!;

for (1 .. 500){
    my $pid = fork();
    defined($pid) or die "fork failed: $!";
    if ($pid == 0) { 
        child_process();
        exit;
    }
}   
print "DONE FORKING\n";
while (wait() > -1) {}

sub child_process{
    while (1){
        while (my $file = glob "*.tmp"){
            open my $FH, '>', $file or next; #next if unlinked already
            if (flock($FH, LOCK_EX | LOCK_NB)){
                next unless flock($FH, LOCK_EX | LOCK_NB);
                print time(), " YES! I ($$) got a lock on $file!\n";
                sleep(.1); 
                unlink $file;
            }
            close $FH;
        }
        sleep(.1);
    }
}
[download]

Running this, waiting for the "DONE FORKING" message and than typing "touch tmp/random.tmp" produces something like this:

1199321503.71728 YES! I (25930) got a lock on random.tmp!
1199321504.01631 YES! I (25827) got a lock on random.tmp!
1199321504.01702 YES! I (25976) got a lock on random.tmp!
[download]

I'm under the impression that there should only be one process that gets an exclusive lock.

Comment on flock LOCK_EX not locking exclusively Select or Download Code

Replies are listed 'Best First'.
Re: flock LOCK_EX not locking exclusively by pc88mxer (Vicar) on Jan 03, 2008 at 06:07 UTC
Here's what's happening: after one process `unlink`s the file, other processes are still able to `flock` their open file handles after the first process `close`s its file handle: `child 1 child 2 open(...) open(...) flock(...) unlink(...) close(...) flock(...) (succeeds!) ... ...` [download] In general `flock` and `unlink` on the same file do not work well together. You really only want to use `flock` on a file which will always exist. Assuming that in this case you are simulating a pool of worker threads trying to grab a unit of work, you can fix things by checking to see if the file still exists after the `flock` succeeds.	[reply] [d/l] [select]
Re^2: flock LOCK_EX not locking exclusively by Anonymous Monk on Jan 03, 2008 at 09:30 UTC
Checking if the file exists after the flock is not good enough. Assume the check is done with a stat. child 1 child 2 child 3 open open flock (works for obvious reasons) stat (works for obvious reasons) unlink close open (creates a new file) flock (works, no contention on new file) stat (works, new file exists) flock (works, no contention on old file) stat (works, new file exists)	[reply]
Re^3: flock LOCK_EX not locking exclusively by pc88mxer (Vicar) on Jan 03, 2008 at 15:51 UTC
Well, if your `open` can create files, then obviously you'll have a problem. I was assuming that `open` would be used with mode `<`. In general, if file names can never reappear after being deleted you can use this method. That's why I say that `flock` generally doesn't work well with `unlink`.	[reply] [d/l] [select]
Re: flock LOCK_EX not locking exclusively by Anonymous Monk on Jan 03, 2008 at 00:57 UTC
It should be "open my $FH, '>', $file or next" It still does the same error though --The Author	[reply]
Re^2: flock LOCK_EX not locking exclusively by Anonymous Monk on Jan 03, 2008 at 01:12 UTC
i meant it should be `open my $FH, '<', $file or next;` [download]	[reply] [d/l]
Re: flock LOCK_EX not locking exclusively by Anonymous Monk on Jan 03, 2008 at 02:13 UTC
I think I figured it out. Sometimes to processes open the file with the same filehandle, and they therefore share the lock because its tied to the filehandle. So the questions become: 1) why do two processes have the same filehandle in some instances, but not in others? 2) how do i ensure that each process uses its own filehandle?	[reply]
Re^2: flock LOCK_EX not locking exclusively (races) by tye (Sage) on Jan 03, 2008 at 06:47 UTC
I think I figured it out. Sometimes to processes open the file with the same filehandle, and they therefore share the lock because its tied to the filehandle. No. open never returns "the same file handle" as in use by some other process, of course (they both could certainly get file descriptor 3 but that doesn't make the file handles related). The biggest mistake in your code is using `'>'` in `open my $FH, '>', $file or next;`. What is happening is that several processes notice the existence of random.tmp. One of them, "X", gets there first and overwrites¹ the file then locks it but then gets suspended while the other processes take turns trying to get some work done. A short time later, one of the other processes that noticed the existence of this file, "Y", finally gets a chance to run and overwrites the file before getting suspended. At this point X and Y each have a file handle open to the same file and X has a lock on it. ¹ I'm using "overwrite" here as short-hand for what opening with ">" does. But this doesn't create a new version of the existing file, it truncates the existing file. If the file doesn't exists, however, it will create a new version of it. A short while later X runs a bit more and deletes the file and unlocks it. Then a third process, "Z", tries to open the same file that was just deleted but since ">" is used a new random.tmp is created and Z now has a handle open to it. So Y has a handle open to the original random.tmp that has just been deleted by X while Z has a handle open to the new random.tmp (and nobody has a lock). Since Y and Z have handles open to different files, they both manage to lock their handles about the same time and report this success. Of course, during this same time, a whole bunch of other processes are overwriting either of those two versions of random.tmp and silently not reporting that they couldn't get a lock. In fact, it is probably one process that creates the second random.tmp then a fourth process manages to open it and get the lock, but that doesn't change any of your output. The other mistake is thinking that waiting .1 seconds is nearly long enough to ensure that 500 (!) processes all have time to finish dealing with the file that they noticed. If you replaced ">" with "<+", then you'd likely get fewer reports of success locking that file and those reports would all be at least .1 seconds apart. Given that systems can get bogged down, it is usually best to have a 2-phase technique so that you can't be burned by a process taking much longer than you expected to get from step 1 to step 2. A third mistake is not waiting between deleting the file and unlocking it. Since you unlink the file before you unlock it, you can prevent some of your race conditions by having the process that obtains a lock on a file check that the file that it has open hasn't been deleted since it got the lock. So, after you get the lock, instead of uselessly trying to get the lock a second time, stat the file by name and stat the file handle that you have open and verify that they refer to the same file (same inode number returned). This presumes that you don't have a malicious extra process doing "ln random.tmp save.tmp; sleep 1; ln save.tmp random.tmp", of course. But a new random.tmp being created (not a new link to the old random.tmp) would not cause any problems. - tye	[reply] [d/l] [select]