File::Find in a thread safe fashion

Preceptor has asked for the wisdom of the Perl Monks concerning the following question:

I'm doing something frighteningly ugly. I'm trying to count disk space across multiple file systems, so I can use it for reporting and charging.

So far, my code using 'File::Find' is pretty much doing what I want.

#!/usr/bin/perl

use strict;
use warnings;

use File::Find;
use threads;

my $size_threshold = 5 * 1024 * 1024; #point at which to abort directo
+ry travers
ing.

my @dirs =  ( "/nas/fs001", "/nas/fs002",  "/nas/fs003",  "/nas/fs003"
+,
              "/nas/fs004", "/nas/fs005", "/nas/fs006", "/nas/fs007",
              "/nas/fs008", "/nas/fs009", "/nas/fs010", "/nas/fs011",
              "/nas/fs012", "/nas/fs013", "/nas/fs014" );
@dirs = ( "/usr/local/apache" );
@dirs = ( "/nas/fs001", "/nas/fs002" );

#"/usr/local/apache";
my $customer_file = "/usr/local/apache/htdocs/dusage/disk_usage.conf";
my $max_depth = 7;

my $debug = 1;

sub dusage
{
  my $dir = pop; # 1 arg only, because that lets me thread.
  my $tsize;
  my %rtree;
  my $datafile = $dir;
  $datafile =~ s,/,,g;
  print "Opening $datafile for output";
  open ( OUTPUT, ">$datafile.csv" ) or die $!;
  find (
     sub {
       if ( -f && ! -l )
       {
         my $filesize = -s $_;
         $tsize += $filesize;

       #chop up the path, populate rtree at each of traverse_depth lev
+els

         my @directory_structure = split ( '/', $File::Find::name );
         pop(@directory_structure); # we'll never want the trailing fi
+lename

         for ( my $depth = 0; $depth <= $max_depth; $depth++)
         {
           if ( $#directory_structure < $depth ) { next };
           my $thispath = join ( '/', @directory_structure[0..$depth])
+;
           $rtree{$thispath} += $filesize;
         }
       }
     }
     , $dir );

    foreach my $key ( keys ( %rtree ) )
    {
      my $indent = ( $key =~ tr,/,, );
      print OUTPUT $indent, ",", $key, ",", $rtree{$key},"\n";
    }

  close (OUTPUT);
}

#main

foreach my $directory ( @dirs )
{
  dusage ( $directory );
}
[download]

Now, here's the problem.

I've been looking at doing a 'thready' version, so I can run across these filesystems all at once.

e.g. changing that 'last bit' to:

my %threads;
foreach my $directory ( @dirs )
{
  print "starting $directory search thread\n";
  $threads{$directory} = threads -> new ( \&dusage,$directory );
}
foreach my $directory ( @dirs )
{
  print "waiting for $directory collator thread to join...";
  $threads{$directory} -> join;
  print "done.\n";
}
[download]

Now, this just doesn't work. The reason as far as I can tell, is that File::Find defines itself globally, so stuff I do within the each thread mutually clobbers each other. (The 'threading' works fine if I only use one thread, as far as I can tell).

Is there an obvious/relatively painless way of doing what I want here? e.g. doing multiple File::Find's at once.

I appreciate I can quite easily just run multiple instances of this program, with different directory lists, and if there's no other solution, I'll try doing that, but I was hoping to be able to do a 'collect, collate, report' within a single bit of code.

Edit: Looks like what I'm looking for is the 'dont_chdir' option to File::Find. I've amended this, and will be re-running this to see how it works out.

Comment on File::Find in a thread safe fashion Select or Download Code

Replies are listed 'Best First'.
Re: File::Find in a thread safe fashion by Corion (Patriarch) on Jul 28, 2006 at 09:07 UTC
I think that thread variables are unshared by default and thus your use of File::Find and `$File::Find::name` should still work. But I haven't used threads, so I can't really tell you. While searching for something else, I came across acme's journal entry where he mentions Proc::ParallelLoop, which does fork subprocesses to work on loops in parallel. As you do the output directly from your threads and Proc::ParallelLoop seems even to buffer the output from the threads, you could be able to directly use that module. I haven't used it though.	[reply] [d/l]
Re^2: File::Find in a thread safe fashion by Preceptor (Deacon) on Jul 28, 2006 at 11:08 UTC
If I do run more than one thread, I get a whole lot of errors like: Can't cd to (/nas/fs001/export/home/) .GROUP_CONFIG: No such file or directory at ./disk_report.pl line 56 I am assuming that this is because File::Find is using shared vars for at least something that it does.	[reply]
Re: File::Find in a thread safe fashion by tweetiepooh (Hermit) on Jul 28, 2006 at 11:34 UTC
I tend to just run multiple copies of the program controlled from the parent along the lines of `if (no runtime param) { @list = (list,of,parameters); foreach (@list) { system("this program with parameter 2>&1 >own.log &"); } } process(runtime param); sub process { ... }` [download] Since I work with databases each copy can just stuff it's data in and the database takes care of the rest. Maybe you could arrange the parent to pause until all the kid's have finished, then collate the data collected? I do this 'cos it works and scripting is a minor part of my job so I don't get the time to do it better/different.	[reply] [d/l]
Re^2: File::Find in a thread safe fashion by Preceptor (Deacon) on Jul 28, 2006 at 11:41 UTC
Well that's my fallback plan - dump the data to files, and collate later, but I was mostly just trying to be clever.	[reply]
Re^3: File::Find in a thread safe fashion (fork) by tye (Sage) on Jul 28, 2006 at 15:52 UTC
Note that telling File::Find to not chdir will cause it to run slower (every opendir and every stat will have to parse longer and longer path strings and retraverse those directories in order to get to the items involved), though I'm not sure how much slower. You might be better off using multiple processes instead of threads (as demonstrated and without using the non-portable fork to boot; tweetiepooh++). - tye	[reply]
Re: File::Find in a thread safe fashion by BrowserUk (Patriarch) on Jul 28, 2006 at 12:03 UTC
You'll get the same problem with using any code that relies upon global state if you try to use it in threads. You'll likely get a similar problem if you try to use File::Find in an event driven environment like POE. There was a recent thread about what is wrong with File::Find. For me, it's this dependancy upon global state that is it's biggest failing. It's not just the Perlish global vars that form the interface that is a problem--ithreads heroically attempts and mostly succeeds in defending you against that--it's also the way it changes the state of the process by changing the process' current working directory. I'm not sure of the case under pthreads, but on Win32, the CWD is a process-wide state at the OS level, so running File::Find in threads whilst it has this behaviour will never succeed. You should be able to use opendir/readdir/closedir successfully in a threaded environment. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: File::Find in a thread safe fashion by Preceptor (Deacon) on Jul 28, 2006 at 12:18 UTC
I think I may just be stuffed. Although that said, I believe File::Find _does_ have a 'don't chdir' option to it. I shall investigate and experiment.	[reply]
Re^3: File::Find in a thread safe fashion by Preceptor (Deacon) on Jul 28, 2006 at 12:42 UTC
Ah, yes, if I change this code so I have a 'dont_chdir' in there, it seems to be running smoothly. It'll take a while to verify, but this _looks_ a lot more promising.	[reply]
Re: File::Find in a thread safe fashion (speed) by tye (Sage) on Jul 28, 2006 at 16:41 UTC
The fastest way to use File::Find is to turn off the 'count nlinks' "optimization" and then avoid using things like "`-s $_`", using "`-s _`" instead (note that "`-f`" is just short for "`-f $_`" and so should be "`-f _`" instead). Note that treatment of "`-l _`" can sometimes be a problem with this scheme. Unfortunately, Perl doesn't allow you to cache more than one stat/lstat result so even if File::Find does both lstat and stat, you'll only have access to the last one done. The code for File::Find has become quite convoluted and I'm not going to spend hours trying to track what it is doing. But, based on what needs to be done (and what I do when I roll my own replacement for File::Find, which I often find easier than trying to figure out subtle vagarities of File::Find), if you set `$File::Find::dont_use_nlink= 1` and don't ask File::Find to follow symbolic links, then File::Find will have to lstat every file and doesn't need to stat any files so your "wanted" sub should get called such that "`-l _`" tells you whether or not the found item is a symbolic link (and you can't tell anything about what the symbolic link points to without issuing you own stat by not using the "`_`" stat cache). And this is usually exactly what you want. So my suggestions for changes to your code are: `#... use File::Find; $File::Find::dont_use_nlink= 1; # Avoid slowing "optimization" #... my @dirs = qw( /nas/fs001 /nas/fs002 /nas/fs003 /nas/fs003 /nas/fs004 /nas/fs005 /nas/fs006 /nas/fs007 /nas/fs008 /nas/fs009 /nas/fs010 /nas/fs011 /nas/fs012 /nas/fs013 /nas/fs014 ); #... my $dir = pop @_; # 1 arg only, because that lets me thread. #... if ( -f _ && ! -l _ ) { my $filesize = -s _; #...` [download] Note that I replace `pop` with `pop @_` as making the use of @_ implicit is against my best practices because I've seen code where this practice has made it difficult to figure out how the subroutine arguments are being used (it also prevents bareword problems and eliminates the risk of confusion with an implicit @ARGV). I suspect you can drop the `&& ! -l _` from your code, since you'll have the cached results from lstat so `-f _` being true will mean that the found item isn't a symbolic link. But leaving it in doesn't hurt either. - tye	[reply] [d/l] [select]
Re: File::Find in a thread safe fashion by Moron (Curate) on Jul 28, 2006 at 11:48 UTC
If the reason for the threading is performance, then it seems to me that you are recalculating the statvfs structure, which is more readily available via the module Filesys::Statvfs (or one of its brothers). The nix command df -k uses that same system-internal structure, reporting it back one line per file systems as device, blocks, #used, #available, %capacity and mount_point. On a huge Sun Solaris system with hundreds of file systems, I just got the result back from df -k in only 0.02 seconds and would expect a Perl program using such a Filesys module to perform comparably well. Update:* If your needs are indeed limited to what df does, Filesys::Df will be easier to use or Filesys::DfPortable if the code also has to run on any of Mac OS X, Unix, Linux, Windows 95 or later and so on. More update: In addition, to get per user per file-system stats, you could also enable disk usage quotas, without actually limiting usage, but to enable retrieval of such information via the Quota module. -M Free your mind	[reply]
Re^2: File::Find in a thread safe fashion by Preceptor (Deacon) on Jul 28, 2006 at 12:16 UTC
Thanks, I'll look into those. My needs are _loosely_ what a 'du' does, but a little more complicated - I'm needing (assuming a tree of): `/usr 10mb /local 5mb /apache 5mb /include 1mb /system 1mb` [download] followed by a little hackery to assign different 'structures' to cost centres. So yes, doing a du of /usr, then of /usr/local, then of /usr/local/apache, would be a solution, but then I'd end up reading the tree lots of times, which'd get very expensive.	[reply] [d/l]
Re^3: File::Find in a thread safe fashion by Moron (Curate) on Jul 28, 2006 at 12:52 UTC
If each such 'structure' could be put in a separate device partition, either directly or probably handier using symbolic links to isolate it from where it normally lives, then Filesys:Df would still do the job more efficiently than having to recalculate the (f)statvfs yourself. Update: The way my own hosting supplier does it is to have a separate partition per client, symbolically link the top directory of each website structure as a subdirectory of where the apache server is installed and alias each website to that directory in the httpd.conf. They have a different location for all the webmail though, just because that has a different tariff per MB. I imagine there is a downside that they have to automate partition allocation and it is hard for customers to give up disk space they have requested because extending partitions is significantly easier than recycling part of an allocated partition, even more so if this has to be automated. -M Free your mind	[reply]