Continuously polling multiple directories for file transfer?

pktrain has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks,

I need to send files to a process whenever new files become available in a set of directories. The files can appear in the directories as quickly as every second to every minute. Each file should only be passed to the process (pqinsert) once and no files should be missed (both of these conditions must be satisfied). I wrote a Perl script to continuously poll the directories at a given interval, pass the file names to the external process, then move the processed files to an archive sub-directory.

Can you tell me if the code below is the most efficient to solve this problem and if it isn't give suggestions for improvement? Thank you so much!

#!/usr/bin/perl -w

use strict;
use warnings;
use diagnostics;
use File::Copy;

$path = "/mnt/ldmdata/";
@site_array = {"karx", "kdlh", "kfsd", "kmpx", "kmvx", "kwbc"};
$poll_time = 20; # # of sec between polls of all specified directories

for (;;)
{
    foreach $site (@site_array)
    {
        $file_dir = $path . $site;
        $archive_dir = $file_dir . "/archive";
        mkdir "$archive_dir", 0755 unless -d "$archive_dir";

        opendir(FILE, $file_dir) || die "Cannot open $file_dir";
        @files = readdir(FILE);
        closedir(FILE);

        if(@files)
        {
            foreach $file (@files)
            {
                pqinsert $file_dir . $file;
                move($file_dir . $file, $archive_dir . $file);
            }
        }
    }

    sleep $poll_time;
}
[download]

UPDATE: I wanted to share a package I found which is built on "inotify" to perform monitoring/action tasks like this. It looks pretty robust:

iWatch

Comment on Continuously polling multiple directories for file transfer? Download Code

Replies are listed 'Best First'.
Re: Continuously polling multiple directories for file transfer? by almut (Canon) on Feb 10, 2009 at 12:38 UTC
What's "most efficient" usually somewhat depends on the context, so I'll refrain from making absolute statements here... But generally, event based notification frameworks scale better than polling, i.e. in this case, when the number of files/directories gets large, the resources required to perform the same task will be less. What's available in this respect depends on the OS you're using. For example, on Linux, there's inotify, for which exists the Perl binding Linux::Inotify2. The basic idea is to register handlers that are being called by the notification system (via hooks in the kernel) when certain events happen, like file creation, etc.	[reply]
Re^2: Continuously polling multiple directories for file transfer? by Corion (Patriarch) on Feb 10, 2009 at 12:46 UTC
For Windows, there is Win32::ChangeNotify, which also can tell you about lots of changes happening in the file system. In theory, if you're using WMI, you should be able to listen (for example) to the `__InstanceCreationEvent` of `CIM_DataFile`, but a cursory search of the intertubes does only tell me of failures trying this.	[reply] [d/l] [select]
Re^2: Continuously polling multiple directories for file transfer? by Anonymous Monk on Feb 10, 2009 at 23:27 UTC
The script will be run in Ubuntu in a VMWare virtual machine environment, so I will look into "inotify". Thank you all for the helpful comments!	[reply]
Re: Continuously polling multiple directories for file transfer? by osunderdog (Deacon) on Feb 10, 2009 at 13:11 UTC
Just sharing an issue I've run across in this domain. There is a difference between a new file appearing on disk and when the file has been filled and is whole. For example if a file is FTP'd to a directory there is a period of time where the new file exists, however it has a zero byte size or partial byte size. There are various ways to get around this depending on your circumstances, but it's dangerous to assume that the file system performs atomic operations on disk. Still looking. Still searching.	[reply]
Re^2: Continuously polling multiple directories for file transfer? by MidLifeXis (Monsignor) on Feb 10, 2009 at 15:40 UTC
I am using a possibly outdated FTP RFC. See the commands STOU, RNFR, and RNTO. Along these lines, it might be better to try to STOU the file into a unique file name, and then rename the file once it has been completely uploaded. This is similar to the techniques used under *nix to try to assure atomicity. If you create the file under the same name as what it needs to be, and your ftpd does not do stuff behind the scenes to ensure that when a file is added to the file system it is complete, you will have this race condition. it's dangerous to assume that the file system performs atomic operations on disk. Under POSIX, I believe that the atomic semantics are "required", but under Windows, this may not be the case. That is, however, not saying that RNFR and RNTO require the use of the rename() POSIX semantics. --MidLifeXis	[reply]
Re^2: Continuously polling multiple directories for file transfer? by foobie (Initiate) on Feb 11, 2009 at 11:17 UTC
We found that for xml files you can check if the file parses OK - if it's mid-download it will have unclosed tags and parsing will fail. This is assuming you have a single root node per file.	[reply]
Re^3: Continuously polling multiple directories for file transfer? by foobie (Initiate) on Feb 11, 2009 at 11:20 UTC
Or you can upload a second file once the first one has completed, eg filename.complete - when that the second one appears you can assume the first one is complete. Dunno how truly atomic it is, but we had no problems on a live system handling thousands of uploads/day over several years.	[reply]
Re^4: Continuously polling multiple directories for file transfer? by pktrain (Acolyte) on Feb 11, 2009 at 19:41 UTC
Re^5: Continuously polling multiple directories for file transfer? by foobie (Initiate) on Feb 12, 2009 at 21:48 UTC