Re: NFS File Locking
by hippo (Archbishop) on Apr 09, 2026 at 13:32 UTC
|
I am wondering if this is a known problem and if there is a preferred way to work around this.
The workaround (in the scenario where you are unlinking a worked-upon file) would be to lock a file other than the one on which you are operating. All your workers should be attempting to lock $lock_file_name and only once the lock is achieved should they even begin to look at $file_name and operate on it. This allows operations such as creation and deletion of the object file which could otherwise be troublesome.
All that said, it does still sound a bit like an XY Problem perhaps along the lines of that to which brother InfiniteSilence alludes.
| [reply] [d/l] [select] |
|
|
Prossibly completely irrelevant, but: I want to say that folk wisdom was to use a directory as lock when attempting to do any sort of locking over NFS as operations on those were a bit more robust pre nfsv4. I haven't tried to do locking on anything on NFS in anger in aeons so YMMV, contents may have settled during shipping, yadda yadda.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] |
Re: NFS File Locking
by choroba (Cardinal) on Apr 08, 2026 at 13:15 UTC
|
What version of NFS are you running? 3 and 4 are quite different when it comes to locking.
How is the NFS configured? What are the values of lock and local_lock?
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
I just use "mount -t nfs4" to mount the file system. Locking parameters would be whatever the defaults are.
I am sure it is not using local locking. If it were using local locking, then things wouldn't work at all.
The locks appear exclusive between machines, except for rare cases right around the time when the file is deleted.
| [reply] |
Re: NFS File Locking
by duelafn (Parson) on Apr 09, 2026 at 11:48 UTC
|
If you unlink after opening but before before processing that would reduce the race window. On a local filesystem that fully removes the file from the directory listing, but as I understand it, over NFS it does a rename to ".nfsXXXX" rather than an unlink. Other processes would ignore such file names. If it were me, I'd experiment with manually doing the rename just to make sure I was certain about the rename pattern. You may also need to periodically clean up or process old .nfsXXX files in case something crashes during processing.
Update: ... or just flat-out rename before opening (to .processingXXXX-ORIGINALNAME) using the rename as a lock operation?
| [reply] |
|
|
| [reply] [d/l] |
|
|
Thanks Dean. I think this is the problem. If a file is deleted on NFS when there are still open file handles, it is temporarily renamed .nfsXXX and the file handles remain valid. This can create exactly the race I was observing. Machine A holds a lock. Machine B opens a file handle but before it can request a lock Machine A deletes the file and releases the lock. Now Machine B gets the lock on the renamed file. Explicitly renaming the file doesn't help here. I think the work around is just to check that the file still exists (using the original name) after acquiring the lock. If not, just close the file and assume the lock failed.
I didn't mention the reason for doing this in the original question. But the goal is to parallelize work across many machines. Each file represents some work and the locking makes sure that two machines are not working on the same thing. The problem is that the machines are also used for other things and often are shut down are crash in the middle of doing work. Using "flock" is nice because the NFS locks are automatically released if a machine is reset or crashes. This allows another machine to come along later, pick up the work and do it again. The file is not deleted or renamed until the work is confirmed complete. Using "rename" or "mkdir" for locking leaves the lock orphaned if the machine holding the lock crashes. Doing the same work twice is not a problem as long as it doesn't happen too often. Failing to complete work because the machine doing it crashed in the middle is the problem I want to avoid.
| [reply] |
Re: NFS File Locking
by InfiniteSilence (Curate) on Apr 08, 2026 at 23:43 UTC
|
That code looks like some kind of import program where various clients put some information in a file share and it gets slurped and processed.
My immediate solution would be to identify the unique key information from the files to be processed and put them in a process log of some kind (or preferably a DB) and then run a quick check to see if the data has already been entered before proceeding with the import.
When you find an error (file already processed) you can store that in a junk folder for later evaluation and troubleshooting.
I would create a test server and a bunch of bot type clients that try to import data by the thousands to see if and when you can recreate the problem. I would exclusively unlock the file to see if that changes things (code snippet from the POD for flock):
flock($fh, LOCK_UN) or die "Cannot unlock mailbox - $!\n";
Celebrate Intellectual Diversity
| [reply] [d/l] |
Re: NFS File Locking
by Marshall (Canon) on Apr 09, 2026 at 23:28 UTC
|
I put this NFS race condition question to an AI and its opinion was interesting:
The race condition you are experiencing occurs because flock and NFS have historically poor compatibility.
In many environments, flock only manages local locks that are invisible to other NFS clients.
To resolve this race condition, implement one of the following proven strategies:
1. Switch from flock() to fcntl()
While flock() is often local-only, fcntl() (POSIX locking) is specifically designed to work across a network via the Network Lock Manager (NLM) for NFSv3 or natively in the NFSv4 protocol.
Interaction: In Linux kernels 2.6.12 and newer, flock() on an NFS mount is actually emulated using fcntl() byte-range locks, but this emulation can still
be prone to race conditions during lock upgrades or if the server does not support the specific emulation.
Direct implementation: Using fcntl() directly provides more reliable byte-range locking across multiple clients.
2. Use the "Link-to-Lockfile" Method (Most Reliable)
This is the most portable and robust method for NFS, as it relies on the atomicity of the link() system call, which is better supported across NFS versions than file locking.
Create a unique temporary file on the same NFS filesystem (include the hostname and PID in the name).
Attempt to create a hard link from this unique file to a standard "lock" filename (e.g., myfile.lock).
Check the success:
If link() returns 0, the lock is acquired.
If it fails, use stat() on your unique file to see if its link count is 2; if so, you have the lock.
Release: To unlock, simply unlink() the standard lock file.
3. Verify NFS Version and Mount Options
If you must use standard locking, ensure your configuration supports it:
NFSv4: Recommended, as locking is integrated into the protocol and does not rely on external daemons like lockd or statd.
Check Mount Options: Ensure you are not using the local_lock mount option, which forces all locks to stay local to the client, effectively breaking cross-machine synchronization.
Flush Data: After releasing a lock and before another machine acquires it, ensure the data is flushed to the server to prevent the next client from reading stale, cached data.
4. Alternative: Atomic Directory Creation
On many Linux-based NFS implementations, mkdir is atomic. You can attempt to create a directory as a lock. If the operation succeeds, you hold the lock; if it returns an "already exists" error, another process has it.
| [reply] |