KevinBr has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that runs as a daemon. It scans a directory for files, processes the files, moves the files to backup, then repeats the process by scanning again for any new files that may have arrived while the others where processing. The processing of the files can take a few seconds to several minutes.

To make my process more efficient, I use afork to multithread a limited number of child processes.
sub afork (@$&) { # First field = array, second field = max number of processes to # run at the same time, third field = subroutine to run against # each array element my ($data, $max, $code) = @_; my $c = 0; foreach my $data (@$data) { wait unless ++ $c <= $max; die "Fork failed: $!\n" unless defined (my $pid = fork); exit $code -> ($data) unless $pid; } 1 until -1 == wait; } while(1) { @FILES = `ls -1 $DIRECTORY`; afork (\@FILES,3,\&process_file); }
The problem is, groups of files tend to arrive in this directory all at once. If 10 files arrive and I send these files in an array to the afork routine that limits them to running 3 at a time, I must wait until all 10 files complete before I can scan the directory again for any new files. This is is troubling when I have one file in that group of 10 that takes 20 minutes to complete, but the rest of the files completed in a few seconds. I am forced to wait until the one large file completes before I can begin processing any new files.

Is there a way to update the array I pass to afork on the fly, or should I use something else altogether?

Replies are listed 'Best First'.
Re: read directory, fork processes
by ikegami (Patriarch) on Feb 24, 2010 at 01:24 UTC
    Why not create three worker threads. Whenever a new file arrives, add it to the queue. The workers pick stuff off the queue as they become idle.
    use strict; use warnings; use threads; use Thread::Queue qw( ); my $num_workers = 3; sub process_file { my ($file) = @_; ... } my $q = Thread::Queue->new(); for (1..$num_workers) { async { while (defined(my $file = $q->dequeue())) { process_file($file); } }; } for (;;) { ... wait for new files ... $q->enqueue(@new_files); } $q->enqueue() for 1..$num_workers; $_->join() for threads->list();

    Adding use forks; should make the above use processes instead of threads.

Re: read directory, fork processes
by jwkrahn (Abbot) on Feb 24, 2010 at 02:59 UTC
    sub afork (@$&) {

    Your prototype does not do what you seem to think it is doing.    From Prototypes: Any unbackslashed "@" or "%" eats all remaining arguments, and forces list context. so it is the same as if you had not used a prototype at all (which you really shouldn't be using anyways.)

Re: read directory, fork processes
by BrowserUk (Patriarch) on Feb 24, 2010 at 09:10 UTC

    One addition I would make to ikegami's post, is when doing your file discovery process in the main thread, rename the files into a work directory before queuing the new name. If the target of rename is on the same disk, it will take very little time regardless of the size of the file and will keep the arrivals directory clear of known files, greatly simplifying the next phase of the discovery process.

    Also, if you can make the the final (post-processing) destination (the backup location), on another drive, that will help with disk head thrash. But don't rename them there immediately, as that would require a copy operation and slow thing down again.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      We use a variation of the technique BrowserUk suggests, and it works quite well. Any process dropping something in the work queue directory initially creates the file with a ".work" extension. When the file is completed, then it's renamed to remove the extension. And (of course) the processor ignores all files with the extension.

      ...roboticus

        roboticus, do you mind sharing your code?
Re: read directory, fork processes
by cdarke (Prior) on Feb 24, 2010 at 11:50 UTC
    To make my process more efficient

    If you are interested in efficiency then I suggest you don't use the ls(1) program to get your list of files:
    while(1) { #@FILES = `ls -1 $DIRECTORY`; @FILES = glob("$DIRECTORY/*"); print "@FILES\n"; }
    glob has a number of other advantages (and differences) apart from avoiding an extra child process each time you go around the loop. Using ls(1) results in a new-line at the end of each filename which (presumably) you have to remove, in addition you (presumably) have to prefix $DIRECTORY/ to each filename before you process it. You don't get any new-lines added with glob, and the directory name is included in the file name. Downside is that it does not handle directory names containing white-space (use File::Glob qw(glob); if that is an issue).

    If you are running on Linux you might be interested in Linux::Inotify2 to monitor your directory.
Re: read directory, fork processes
by zentara (Cardinal) on Feb 24, 2010 at 13:16 UTC
      Thanks to all for the suggestions! I will post a response when I have implemented a solution.