Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am working on part of a publishing system that is in a multiserver enviroment. One of the current pieces takes incoming data out of multiple FTP directories every half hour and copies it to another directory using a system call to cp. 99.9% of the time this is working but occasionally I file will be mid FTP upload when the process kicks and I'll get 0 k or half files coming in. I've tried a few different things to no avail and was wondering if anyone knew of a way to be sure files weren't still in the process of being uploaded when I started running code centering around them.

Replies are listed 'Best First'.
Re: FTP and File Copying
by merlyn (Sage) on Nov 27, 2000 at 05:54 UTC
    Always upload to a "temp" file name, then rename to a "good" file name (via FTP rename operations) when the upload is finished. You'll never get a partial file that way.

    -- Randal L. Schwartz, Perl hacker

      That is not one of my options. The files need to be taken as is once they are uploaded. I don't have the clout nor do the submitters have the knowledge to go in and rename files via FTP. Most of the people uploading these files are lucky to be able to FTP them. I cannot enforce new to old naming conventions and expect them to do this, I am lucky I get them to stick to the naming conventions at all. So the issue is I need to be able to tell if a file is currently open and being written to by any other process. I haven't been able to find anything as of yet to do this. I have though of work arounds such as writting filenames and time stamps to a DB and looking to the xferlog for new incoming files but I would much rather do this on the level of the files itself rather then bringing in extraneous resources. I've also tried flock but I can get a full lock even if the file is being written to by another source (such a file becoming a tar can flock'd mid tar) which I don't want if the file is still being written to. Any process I have control of do go temp to final filenames when using FTP where I can.
        If you are running under a Unix OS, you can use a utility called "lsof" to determine what processes currently have the file open, and the mode used, etc. My lsof binary points to the URL ftp://vic.cc.purdue.edu/pub/tools/unix/lsof, if your system doesn't have it. The exit status of this program is 0 if any programs have the file open.

        I really think this is kind of a crappy solution, though, because other programs may have that file open for other reasons that might not be easy to differentiate, and (as mentioned elsewhere), it doesn't help you if the FTP process legitimately closes the file in an incomplete form.

        There's no way. You cannot know if it's open. All you can do is quarantine any file touched in the last 15 minutes or so.

        Some problems cannot be solved within the requirements specified.

        -- Randal L. Schwartz, Perl hacker

        If you really do have to get them as soon as they are finished, this sounds like a project for a secure web server. Of course, that won't work for dialin folks, but could be a good solution.

Re: FTP and File Copying
by lhoward (Vicar) on Nov 27, 2000 at 18:37 UTC
    Just to throw another log on the fire...

    Another possibility might bet watch the FTP server's log and copy files when the ftp server logs them as complete. That way if tail the ftp server's log in real-tie you could copy the files to their final location immediately instead of sheduling the replication every half hour like you are doing now.

Re: FTP and File Copying
by a (Friar) on Nov 27, 2000 at 09:15 UTC
    Do you have to process the file right away? Can you do Merlyn's original suggestion: copy the file somewhere temporary and come back to it. If it matches the incoming file, its done, handle and delete. If not, try again. Puts you off one 'check for incoming' loop at first but you're already waiting a half hour, so its not 'real time' critical.

    We use a checkfile process, if the file's the same size/un-touched for some # of minutes, it's considered done. Guess part of this may be what the fallout is of handling half done files vs. the need for speed.

    a

Re: FTP and File Copying
by Albannach (Monsignor) on Nov 27, 2000 at 10:32 UTC
    Since there doesn't appear to be a good solution to this one yet, how about a bad one ;-)

    Your script could record the sizes (etc.) of all the files in the upload directory some time before the half hour mark, then go to sleep. When the time comes for doing the copying, the script only works on the files whose sizes are unchanged from the previous snapshot. I'm guessing that 5 minutes ought to do, but if your senders are as iffy as you say, maybe you want to take several snapshots.

    One other idea that you probably considered and discarded, but placed here just in case: If there is any consistency to the file structure, can you detect whether they are complete by the content in any way? I'd like to suggest you get the submitters to put a standard end line of some kind but it sounds like that won't fly.

      On your last point md5 sums seem like the right idea.

      I.e., make a rule that says you must upload an md5 signature along with any file -- that way you know when a file is done -- it matches its signature.

      Obviously this is a problem with less-than-skilled users.

Re: FTP and File Copying
by 2501 (Pilgrim) on Nov 27, 2000 at 07:00 UTC
    I guess depending on the reason why you are copying the file, could you instead create links to the files and put the links in the target dir?
    That would prevent partial copying, but it could be a problem if you are trying to prevent corruption of original uploads.