in reply to Re: NFS File Locking
in thread NFS File Locking

Thanks Dean. I think this is the problem. If a file is deleted on NFS when there are still open file handles, it is temporarily renamed .nfsXXX and the file handles remain valid. This can create exactly the race I was observing. Machine A holds a lock. Machine B opens a file handle but before it can request a lock Machine A deletes the file and releases the lock. Now Machine B gets the lock on the renamed file. Explicitly renaming the file doesn't help here. I think the work around is just to check that the file still exists (using the original name) after acquiring the lock. If not, just close the file and assume the lock failed.

I didn't mention the reason for doing this in the original question. But the goal is to parallelize work across many machines. Each file represents some work and the locking makes sure that two machines are not working on the same thing. The problem is that the machines are also used for other things and often are shut down are crash in the middle of doing work. Using "flock" is nice because the NFS locks are automatically released if a machine is reset or crashes. This allows another machine to come along later, pick up the work and do it again. The file is not deleted or renamed until the work is confirmed complete. Using "rename" or "mkdir" for locking leaves the lock orphaned if the machine holding the lock crashes. Doing the same work twice is not a problem as long as it doesn't happen too often. Failing to complete work because the machine doing it crashed in the middle is the problem I want to avoid.