MitchInOmaha has asked for the wisdom of the Perl Monks concerning the following question:

We have a mainframe process that drops off files into a shared directory. I have four Perl processes that attempt to read from the directory and process those files.

I'm looking for a way to synchronize the four Perl processes to keep them from each trying to process the same file.

My original plan was to File::Copy::move() the files to a local /tmp folder, and only the first guy to move the file would win (if your move failed, you just move on to the next file).

Turns out this fails because File::Copy::move() is actually implemented as a multi-step copy/set-attribs/delete process that doesn't have any file locking protecting it. I ended up with multiple processes each copying the same file, one would delete the source file, and the other would be just getting done and try to stat() the original file (to set the atime and mtime of the new copy) and there'd be no original file.

How can I keep these four processes (each on different servers, incidentally) from clashing and trying to process the same file?

-- Mitch

  • Comment on Synchronizing multiple processes retrieving files out of shared directory

Replies are listed 'Best First'.
Re: Synchronizing multiple processes retrieving files out of shared directory
by graff (Chancellor) on Mar 20, 2014 at 03:54 UTC
    Renaming a file within a directory should be a one-step (atomic) process (rename invokes a system an OS-level system library call - but as such, there might be site-specific issues you need to watch out for).

    This approach relies on having some reliable constraint on the names of files being dropped in, so your processes can look for a file name that matches the constraint, and rename it to not match the constraint.

    Still if you can rely on some renaming pattern that will surely not collide or be confused with incoming file names, then it should be possible to rule out race conditions. If two processes both spot the same new file, only one of them can succeed in renaming it (naturally, each process should use its own distinctive pattern for renaming files).

    The process that fails when it tries to rename can just go back to looking for another new file.

      My original goal was to avoid having filename constraints on the content being dropped off into the shared directory. (Original plan included having multiple customer departments dropping off work, but it's turning out that working entering this directory is all coming from the same location, so we have more control over naming than originally expected.)

      I'm going to start by trying lock files with exclusive locks on them. If a process can't create a new file with an exclusive lock, then they move on to the next file in the directory. If that doesn't work, I'm going to see if I can implement a rename-based process.

      Thanks.

        I think if you were to create sibling or sub-directories on the same disk volume as the one where the depository directory exists, then the perl-internal rename function will still work (and still be atomic in the same way). Each process could create its own sub- or sister directory, and the same logic should apply.
Re: Synchronizing multiple processes retrieving files out of shared directory
by thezip (Vicar) on Mar 19, 2014 at 21:50 UTC

    Perhaps you could create a temporary lockfile for a file that's currently in use. Other processes would check for (and ignore) any file that has a lockfile present.

    Simplistic, but it might work...


    *My* tenacity goes to eleven...
      The shared disk is Hitachi SAN connected to our Linux boxues with NFS.
Re: Synchronizing multiple processes retrieving files out of shared directory
by Anonymous Monk on Mar 19, 2014 at 21:54 UTC

    I assume from what you wrote that this directory is shared over the network - what's the file sharing protocol? Does it support locking files (flock or one of several CPAN modules)? If so, you could control access by having the Perl processes require a lock on a separate control file to get permission to move a file out. Otherwise, one idea might be to have some kind of simple daemon that the different Perl processes connect to which acts as a semaphore.

Re: Synchronizing multiple processes retrieving files out of shared directory
by Lennotoecom (Pilgrim) on Mar 19, 2014 at 22:06 UTC
    Well, if to rediscover the wheel:
    write the 5th perl process for managing file names
    and let that process
    to give those names away to the processes of yours at request?
    I'm not sure.
    They can talk with each other via simple socket.
    Maybe some database table will do the trick?
    Sorry if these suggestions offended anyone.
      Part of our goals is to have as much redundancy as possible and to avoid having any single process that could impact the operation of the others.

      Having a single process that each of these worker bees have to check in with creates the risk of a single point of failure.

      I'm going to do some testing with creating lock files that have exclusive locks on them, so a competing process can't create a lock file of the same name, preventing them from processing the same file elsewhere.

Re: Synchronizing multiple processes retrieving files out of shared directory
by Anonymous Monk on Mar 20, 2014 at 17:34 UTC
    The "single point of failure" is the entire application inclusive of all of its worker-bees, and the "failure" that can occur is that any or all of those "bees" misbehave. Therefore, focus simply on the strongest possible design. In this case, have one process or thread whose only purpose is to recognize that a new file has arrived. Let that process/thread add an entry to a "work-to-do" queue, which is read by all of the "bees." Now, those "bees" never have to worry about interference. Let all of the processes, including both the bees and the directory-watcher, be owned by a parent-process whose only purpose in life is to make sure that none of its children die.
      That doesn't seem as easily done when those processes live on four separate servers. Although, I'm all for some IPC that could manage those processes in a Linux realm.