elsteve has asked for the wisdom of the Perl Monks concerning the following question:

I have a question about modules such as File::Find, File::Find::Rule, etc. Are there any issues with running these modules across active filesystems that are very large (on the order of 500-800gb)? Page 30 of the O'Reilly book Perl for System Administration lists some cases of when not to use File::Find, and one of the situations is when "you need to change the names of the directories in the filesystem you are traversing while you are traversing it." My code itself won't be changing any directories, but I cannot assume that other users won't be changing things...especially since my code can take a couple of hours to complete. Any advice on this? Being able to use these modules would make my life a lot easier than having to reinvent the wheel. Thanks!
  • Comment on File::Find on huge, dynamic filesystems?

Replies are listed 'Best First'.
Re: File::Find on huge, dynamic filesystems?
by BrowserUk (Patriarch) on Dec 10, 2002 at 19:48 UTC

    There is always a risk that when you take a snap-shot of an active data, that by the time you get around to using some of it, it could be out of date.

    A couple of possibilities to work around this

    If your not concerned with dealing with new stuff that has appeared since you snapshot was taken, then use -e to verify that files/dirs still exist before you take any action on or as a result of the presence of a file reported by File::Find.

    If your running on a Windows system, you can register your interest in the (sub)tree that you have run File::Find on with Win32::ChangeNotify and periodically check for and record any changes that occurred since you did your File::Find. This complicates your code somewhat as every time you process a file, you would need to check your changes mechanism (a hash seems a possible choice) to see if the file in question has had any changes recorded against it.

    There may be a similar mechanism in *nix, but I am not aware of it.

    Without knowing quite what use you are making of the information returned by File::Find, its difficult to make any further suggestions.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: File::Find on huge, dynamic filesystems?
by graff (Chancellor) on Dec 10, 2002 at 20:27 UTC
    I second the earlier comments by BrowserUK about the inherent volatility of the data, and would add just this: if you have a compiled "find" utility available on your system (e.g. GNU find), try using that in a "system()" call, and benchmark it against whatever File::Find-based module approach you want. If your results come out looking like mine did, you could save yourself a lot of runtime by avoiding the module.
Re: File::Find on huge, dynamic filesystems?
by toma (Vicar) on Dec 11, 2002 at 08:23 UTC
    One thing to keep in mind is that someone may access the file just as it is being compressed. This can result in corruption of the read data. The amount of corruption can be quite small, such as the loss of a single character at the end of the uncompressed file.

    One possible way to fix this would be to change the permissions on the file, compress, and then change the permissions back.

    It is unlikely that someone will access an old file at the same time that your program compresses it, but better to prevent the possibility altogether.

    It should work perfectly the first time! - toma

      Changing the permissions would only work if you have your own user for the script -- say, you chown the file to yourself, chmod it to 600 and compress it, then change everything back. Why not just rename it? That would be atomic, plus if the renaming fails because the file or directory doesn't exist anymore, you can skip the compression step.

Re: File::Find on huge, dynamic filesystems?
by elsteve (Acolyte) on Dec 10, 2002 at 21:49 UTC
    I'm running this script on a Solaris machine with large NFS (v3) mounted filesystems from NetApp filers. I just need to walk the filesystem searching for various file types, and then either compress or delete those files depending on their age. I'm sure this same script has been written a hundred times before, but I couldn't find one anywhere, so I'm doing it myself. It's going to be a time consuming process no matter what since I have to stat every file to get the size. Using GNU find is certainly an option, but I assumed (incorrectly perhaps) that it would be quicker to do all the work in perl. Thanks.

      Hang about ! , if I have this correct - you want to traverse an NFS tree and action files that match your logic, but you are concerned about the FS changing under your feet. Is it possible you could nibble away at the output of whichever finder mechanism you choose, thus diminishing (hopefully) the time between 'knowing' about a file and actioning it?


      This is off the cuff code- OK!
      open ( FIND , '|' , '/usr/bin/find' , @findargs ) || die $!; while ( <FIND> ) { if ( $_ =~ /$myfileMatch/ ) { if ( -M $_ > 14 ) { # file modified more than 14days ago # delete file magic } else { # compress file magic } } }

      This way open is happily running find forked, whilst the script does dirty work. Of course if you are compressing monster files across NFS from a filer there is going to be a performance hit of somekind.

      Of course this may not be what you want, and doing it this way I strongly recommend the script keeps a record of exactly what the hell it's doing.

      Update: Since you are concerned about files that are ageing as opposed to new files, it perhaps does not matter too much that new data is added to this tree as the script runs. Reading back methinks that it's better to build a static list of matching files to operate on, then run through that list at the end of search, checking that they indeed -e exist.



      I also hope you only have to deal with NFS to the filer, coz SMB op-locks might cause you pain
Re: File::Find on huge, dynamic filesystems?
by valdez (Monsignor) on Dec 10, 2002 at 19:53 UTC

    Please, can you explain better your situation? For example, what kind of file system are you using? has it quota support? how frequently would you like to update your thing? what kind of update do you need to perform?

    Thanks, Valerio

Re: File::Find on huge, dynamic filesystems?
by John M. Dlugosz (Monsignor) on Dec 11, 2002 at 16:28 UTC
    File::Find works by reading the whole directory, then processing the items inside it. So, if subdirs are added after the scan, they won't be seen. But, if you didn't close the DIR handle at the beginning but read one entry and processed it, what would that do when you added something?

    Unless you come up with a traversal method that you specifically want to use because it works better for you, go ahead and use the module.

Re: File::Find on huge, dynamic filesystems?
by elsteve (Acolyte) on Dec 11, 2002 at 23:07 UTC
    Thanks for the advice, everyone. Very helpful...I hadn't considered the issue with someone else accessing a file while it's being compressed. I'll do the rename thing to try to avoid that. The big question that I was initially concerned about, though is with the directory structure itself. So if I'm walking a tree named something like /one/two/three/four/five/six/seven...I'm down in there reading directory "seven", and then someone comes along and does a "mv /one/two /one/two_old". Is my whole process hung with File::Find? Thanks. Oh, and yes...I'm logging everything I do in the most paranoid way possible :-) And luckily, CIFS shouldn't be an issue in this case.