pktrain has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks,

I need to send files to a process whenever new files become available in a set of directories. The files can appear in the directories as quickly as every second to every minute. Each file should only be passed to the process (pqinsert) once and no files should be missed (both of these conditions must be satisfied). I wrote a Perl script to continuously poll the directories at a given interval, pass the file names to the external process, then move the processed files to an archive sub-directory.

Can you tell me if the code below is the most efficient to solve this problem and if it isn't give suggestions for improvement? Thank you so much!

#!/usr/bin/perl -w use strict; use warnings; use diagnostics; use File::Copy; $path = "/mnt/ldmdata/"; @site_array = {"karx", "kdlh", "kfsd", "kmpx", "kmvx", "kwbc"}; $poll_time = 20; # # of sec between polls of all specified directories for (;;) { foreach $site (@site_array) { $file_dir = $path . $site; $archive_dir = $file_dir . "/archive"; mkdir "$archive_dir", 0755 unless -d "$archive_dir"; opendir(FILE, $file_dir) || die "Cannot open $file_dir"; @files = readdir(FILE); closedir(FILE); if(@files) { foreach $file (@files) { pqinsert $file_dir . $file; move($file_dir . $file, $archive_dir . $file); } } } sleep $poll_time; }

UPDATE: I wanted to share a package I found which is built on "inotify" to perform monitoring/action tasks like this. It looks pretty robust:

iWatch

Replies are listed 'Best First'.
Re: Continuously polling multiple directories for file transfer?
by almut (Canon) on Feb 10, 2009 at 12:38 UTC

    What's "most efficient" usually somewhat depends on the context, so I'll refrain from making absolute statements here...  But generally, event based notification frameworks scale better than polling, i.e. in this case, when the number of files/directories gets large, the resources required to perform the same task will be less.

    What's available in this respect depends on the OS you're using. For example, on Linux, there's inotify, for which exists the Perl binding Linux::Inotify2.

    The basic idea is to register handlers that are being called by the notification system (via hooks in the kernel) when certain events happen, like file creation, etc.

      For Windows, there is Win32::ChangeNotify, which also can tell you about lots of changes happening in the file system. In theory, if you're using WMI, you should be able to listen (for example) to the __InstanceCreationEvent of CIM_DataFile, but a cursory search of the intertubes does only tell me of failures trying this.

      The script will be run in Ubuntu in a VMWare virtual machine environment, so I will look into "inotify". Thank you all for the helpful comments!
Re: Continuously polling multiple directories for file transfer?
by osunderdog (Deacon) on Feb 10, 2009 at 13:11 UTC

    Just sharing an issue I've run across in this domain.

    There is a difference between a new file appearing on disk and when the file has been filled and is whole. For example if a file is FTP'd to a directory there is a period of time where the new file exists, however it has a zero byte size or partial byte size.

    There are various ways to get around this depending on your circumstances, but it's dangerous to assume that the file system performs atomic operations on disk.

    Still looking. Still searching.

      I am using a possibly outdated FTP RFC. See the commands STOU, RNFR, and RNTO.

      Along these lines, it might be better to try to STOU the file into a unique file name, and then rename the file once it has been completely uploaded. This is similar to the techniques used under *nix to try to assure atomicity. If you create the file under the same name as what it needs to be, and your ftpd does not do stuff behind the scenes to ensure that when a file is added to the file system it is complete, you will have this race condition.

      it's dangerous to assume that the file system performs atomic operations on disk.

      Under POSIX, I believe that the atomic semantics are "required", but under Windows, this may not be the case. That is, however, not saying that RNFR and RNTO require the use of the rename() POSIX semantics.

      --MidLifeXis

      We found that for xml files you can check if the file parses OK - if it's mid-download it will have unclosed tags and parsing will fail. This is assuming you have a single root node per file.
        Or you can upload a second file once the first one has completed, eg filename.complete - when that the second one appears you can assume the first one is complete. Dunno how truly atomic it is, but we had no problems on a live system handling thousands of uploads/day over several years.