Fendaria has asked for the wisdom of the Perl Monks concerning the following question:

We have a file based process that operates on XML files on windows (xml file in -> manipulated -> xml file out). (Performance is fine).

Windows box, 2000 Server (or 2003, forget which). Cygwin perl, not activestate (cause of libs).

The files, after they are written, are then processed by a perl script. It runs via a cron job every 2 minutes, scans for new files, renames the new files (adds a timestamp), archives them, and pushes them to a remote machine (tar/scp).

The problem comes that due to some external factors, the cron job doesn't always run every two minutes so occasionally the files backup in the directory waiting to be processed. Right now we have 40,000+ files sitting in that directory. Unfortunately the number of files in the directory has ground the system to a halt. The XML files individually are small, under 1k each, there just happen to be a lot of them.

The perl script is using File::Copy to copy/rename files, unlink to delete them, and tar/scp to move the files around. When the directory is small (under 1000 files) performance is great. When the directory is large, renames can take over 1sec per file, copies can take over 1sec per file, and unlinks are down to under 10 per second. (I can't unlink multiple files at once because I need to know which file fails).

We are working on a variety of solutions, but my biggest concern is the processing time of File::Copy and unlink in perl.

So the big question is, is there any faster way to do this work? I've looked around and can't anything which looks like it might be faster than File::Copy (and unlink).

One item I am considering strongly is attemping to process the files in more of a 'batch' mode. I could shell out and do a system call to 'rm' but then I would have the overhead of the system call and need to test each individual file to make sure it really got deleted. I also no of no real way to batch a call to File::Copy (or the system copy) to handle multiples at the same time (and tell me which of the batch failed for logging/error handling).

Any thoughts or ideas that I should pursue or investigate are greatly appreciated. As well, if this is an OS performance problem that can't really be solved, I'd appreciate knowing that too.

Note: the identical perl script currently runs under linux as well. While linux doesn't suffer from this problem as badly, it does slow down when large numbers of files are in the same directory too though not as significantly. Thus cross platform ideas would help too. However, if I have to put OS specific code into my script for performace gains I am willing to do that. And if I have to install ActiveState for a solution, I'm willing to do that but I would rather not.

Note: I could throw better hardware at this problem, but I really don't want to have to do this. I feel like there ought to be a better way to go about it via software (hopefully in Perl).

Appologies if this post isn't too clear, I'm busy trying to get this working better and don't have days to write up as clean a description as I really wish to be able to do.

Thanks,

Fendaria
  • Comment on File::Copy and file manipulation performance

Replies are listed 'Best First'.
Re: File::Copy and file manipulation performance
by Fletch (Bishop) on Dec 06, 2005 at 21:03 UTC

    Many filesystems don't deal with large numbers of files in the same directory; exactly what counts as "large" varies by OS, version, filesystem type, etc but you're obviously bumping against it.

    One common scheme is to setup multiple directories (named after hex digits for example so you've got "0".."9", "a".."f") and use some sort of hashing function to assign a new file to a subdirectory. That'll reduce the number of files in any one directory to 1/16th of what it was. Add more subdirs ("00".."99", "a0".."ff") or more levels ("0/0", "0/1", . . . .) to spread things out more if required.

Re: File::Copy and file manipulation performance
by psychotic (Beadle) on Dec 06, 2005 at 22:15 UTC
    Instead of scheduling via cron at predefined intervals, take a different route. A perl script running in the background as service, polling the directory for accumulated files for instance every 15 seconds. If the number of files is lower than a given threshold, say 1000 files, and the time elapsed since last proccessing of files (moving around)is less than two minutes, do nothing. If the threshold is reached and time elapsed is less than 2 minutes, perform the operation regardless. Else, if the time threshold goes over two minutes, and the files are less than the threshold, perform the operation.

    The operation most likelly will consist of spawning a child proccess to do the task. If the files aren't actually moved from the watched folder, an internal list kept by the parent proccess could be kept, passing down to its childs related information.

    The system could also be deviced as fail safe. Having a different Perl proccess monitoring the working parent. If the parent fails, automatic restart were it left off and alerting of the admnistrative team.

      I'm unfamiliar with running a perl script on windows as a service. Is there a link/page you can point me to that explains it?

      I'm also unsure about launching threads under perl on Windows but it is something I am considering tackling. My biggest hurdle is making sure the same work isn't done twice (two perl programs checking the same directory and trying to move the same files).

      Fendaria
        When saying "service" i meant with the broad sense of the word. As a background, persistant proccess, that optionally starts up when the machine boots. This can be achieved by utilizing the native Windows Services API either via GUI or with the instsrv.exe commandline utility. That step would most likelly involve feeding the above utility with the full path to perl along with its commandline, the script you want run as a service. If this doesn't work out, you can try the standard All Users>Startup folder that initializes everything upon boot or placiong this in a login script. Plenty of options aside the obvious manual launching.

        As for threads on Windows, things are pretty straing forward with recent versions of Perl. Just use Threads and then do something like $thread = threads->new($coderef,@data) after reading up the documentation.

        Of course, certain data can be shared amongst threads, either by passing back and forth data between them, or keeping data on the parent and handing them down to worker threads, maintaining an index of what has been taken care of, and what is available for the next thread in line to take care of. I am not sure, but this i believe is the "Work Crew" threads model of operation.

        The point in this approach is staying miles away from the wall, thus making impossible to hit. What you were proposing seems to me like speeding at 200MPH and pulling the brakes 50 meters from the wall. Excuse the somewhat not amusing analogy. :)

Re: File::Copy and file manipulation performance
by tirwhan (Abbot) on Dec 06, 2005 at 21:52 UTC

    Are you using the Reiser3 filesystem for the Linux box? Reiser performs a lot better on directories with many small files than any of the other Linux filesystems. It also has a few gotchas (like you shouldn't store a Reiser filesystem image on a Reiser filesystem), but for this purpose it should be the FS of choice.


    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Re: File::Copy and file manipulation performance
by Perl Mouse (Chaplain) on Dec 07, 2005 at 00:17 UTC
    The perl script is using File::Copy to copy/rename files, unlink to delete them, and tar/scp to move the files around.
    I don't get it. What's the point of copying the files with File::Copy if all you want is to move the files from one system to another - which you are doing with scp already?
    Perl --((8:>*
Re: File::Copy and file manipulation performance
by diotalevi (Canon) on Dec 06, 2005 at 22:39 UTC
    File::Copy::move() is just File::Copy::copy() with an unlink() if it succeeded.

      I think not, move() first tries a rename, which is an atomic operation (if both files are on the same filesystem, not on NFS etc.) at least on POSIX systems. I was under the impression this was the same on Windows systems? Only if the rename does not succeed does File::Copy::move() attempt a copy and unlink.


      Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan