Recursive file sizes

Preceptor has asked for the wisdom of the Perl Monks concerning the following question:

I'm sure it's somewhere in the perl lifecyle. Sooner or later, one needs to do recursive stuff on a directory, and print the results.

I've just hit that point. My situation is this: I need to track disk usage of 'shares' on one of our NAS boxes. In order to do this, at present, I'm NFS mounting the filesystems (it's a NAS, so multiprotocol) and doing a recursive file size count

Unfortunately, I could really do with printing this information out 'tree' style.

So in a typical filesystem (example):

/fs001
/fs001/ITC
/fs001/ITC/LSC/
/fs001/ITC/LSC/users
/fs001/ITC/LSC/users/my_user
/fs001/ITC/LSC/users/some_user
/fs001/ITC/LSC/users/other_user
[download]

I would need to be able to print total usage for: the top 4 or 5 levels /fs001, /fs001/ITC, /fs001/ITC/LSC to /fs001/ITC/LSC/users/specific_user

I've hacked together a script to do this, that _sort_ of uses File::Find. Unfortunately, it's gone through a few evolutions since the first concept of it, and it's starting to look hacked and ugly. Basically, what I'm trying to accomplish is charging information (based on per dept, per Gb, where _generally_ each department is in a filesystem of form /fs<number>/<customer>/<department>)

And produce output from this in a useful fashion, such that I can list how much each customer owes, and then provide them with a 'per share' breakdown.

I'm currently looking into doing a rewrite, because as it stands this script takes about 20 hours to complete, partially because it's doing it over NFS, and partially because I'm 'counting' approx 2Tb of data. (And, I'd also imagine, because it's not _amazingly_ efficient)

I'm also currently splurging out data into an XML file, that tends to reach about 20-40Mb, which I then have to run another bit of code across to trim down to a sensible size. The XML output is something I put in for convenience, and portabilty but in retrospect is just that much more convoluted and evil.

Any enlightenment on a 'better' way would be greatly appreciated.

Thanks,
Ed Rolison

#!/usr/bin/perl

use File::Find;
use FileHandle;
use POSIX;
use URI::Escape;
use Time::HiRes qw ( usleep );

use strict;
use warnings;

#proggie to processs filesystems on celerra and work out which departm
+ents
#own what.

#5Mb, if it's smaller then it doesn't bother recursing.
my $size_threshold = 5242880;
my $config_file = "/usr/local/apache/htdocs/dusage/disk_usage.conf";
my @dirs = ( "/fs001", "/fs002", "/fs003", "/fs004", "/fs005", "/fs006
+", 
          "/fs007", "/fs008", "/fs009", "/fs010", "/fs011", "/fs012", 
+);
#my @dirs = ( "/SiteWide/home/erolison" );
my $debug = 0;
my $global_recurse = 5; #default value but used for some interesting c
+alcs.
my @excl_directories = ( '.', '..' ); #cos they're unusual
my %is_excluded;
for ( @excl_directories ) { $is_excluded{$_} = 1; }

my %sizes;
my %customers;
my %totals;

my ( $allfiles, $tenk, $fiftyk, $hundredk, $twohundredk, $fivehundredk
+, $onemeg );

sub data
{
  my $factor = 1024;
  my @sequence = ('b ', 'kb', 'Mb', 'Gb', 'Tb' );
  foreach my $input_number (@_)
  {
    my $seq_num = 0;
    my $output_number = $input_number;
    if ( ! $input_number ) { return sprintf("%3.2fb", 0 ); }
    while ( $output_number / $factor >= 1 )
    {
      $seq_num++;
      $output_number /= $factor;
    }
    return sprintf("%3.2f$sequence[$seq_num]", $output_number);
  }
}

sub getsize
{
#watch out for this. It's a _very_ expensive call.
#look out for optimisations lower down in the chain
  my @args = @_;
  my $sum = 0;
 
  if ( $debug ) { print "Getting size for @args\n"; }

  find sub { if ( -s ) { 
      $sum += -s ;
      $allfiles++;
      if  ( -s > 10240 ) { $tenk++ }
      if ( -s > 51200 ) { $fiftyk++ }
      if ( -s > 102400 ) { $hundredk++ }
      if ( -s > 204800 ) { $twohundredk++ }
      if ( -s > 512000 ) { $fivehundredk++ }
      if ( -s > 1048576 ) { $onemeg++ }
    }; 
    if ( $debug ) { usleep(1000) 
  } }, @args;

  return $sum;
}

sub get_size_of_files
{
  my @args= @_;
  my $sum = 0;
  if ( $debug ) { print "Sizing files in @args\n"; }
  foreach my $iggle ( @args )
  {
    opendir(IDIR, $iggle) or 
      print "WARNING: Couldn't open $iggle\n";
    while ( my $fname = readdir ( IDIR ) )
    {
      #print ( "$iggle/$fname" );
      if ( ( -s "$iggle/$fname" ) )
    { $sum += -s "$iggle/$fname" }
    }
    close (IDIR);
  }
  return $sum;
}

sub dusage
{
  my $startpoint = shift(@_) || '.';
  my $recurse_depth = shift(@_) || 0;
  my @dusage_list;
  #function to show disk usage of all subdirectories.

  if ( $debug ) { print "Reading $startpoint\n"; }
  if ( -d $startpoint)
  {
    #if ( $recurse_depth-- > 0  && $stuff{$startpoint} > $size_thresho
+ld )
    if ( $recurse_depth-- > 0 )
    {
      $sizes{$startpoint} = get_size_of_files($startpoint);
      if ( $debug ) { print "adding $sizes{$startpoint} to $startpoint
+\n"; }

      #my $tmp = $startpoint;
      #$tmp =~ s,/[A-Za-z0-9_\.\,\- ]+$,,g;
      # $stuff{$tmp} += $stuff{$startpoint};
      # if ( $debug ) { print "in lo adding value of $startpoint ( $st
+uff{$startpoint} ) to $tmp = $stuff{$tmp}\n" };
    
      opendir ( DIR, $startpoint);
      while ( my $filename = readdir(DIR) )
      {
        if ( -d "$startpoint/$filename" && !($is_excluded{$filename}) 
+)
        {
          $dusage_list[++$#dusage_list] = "$startpoint/$filename";
        }
      }
      for my $dir (@dusage_list)
      {
        dusage("$dir", $recurse_depth);
      }
    }
    else
    {
   #only process the expensive bit, if we're not going to recurse 'dee
+per'
      $sizes{$startpoint} = getsize($startpoint);
    }
  }
}


sub do_output
{
  my $target = pop(@_);
  my $base_indent = ( $target =~ tr,/,, );
  $target =~ s,/,,g;
  if ( $debug ) { print "$target\n"; }

  open ( FILESTAT, ">file_sizes${target}.html" );
  print FILESTAT "Total number of files = ", $allfiles, "<BR/>\n";
  print FILESTAT "files over 10k in size = ", $tenk, "<BR/>\n";
  print FILESTAT "files over 50k in size = ", $fiftyk, "<BR/>\n";
  print FILESTAT "files over 100k in size = ", $hundredk, "<BR/>\n";
  print FILESTAT "files over 200k in size = ", $twohundredk, "<BR/>\n"
+;
  print FILESTAT "files over 500k in size = ", $fivehundredk, "<BR/>\n
+";
  print FILESTAT "files over 1M in size = ", $onemeg, "<BR/>\n";
  close ( FILESTAT );

  open ( REPORT, ">disk_usage${target}.xml" );
  open ( CSV, ">disk_usage${target}.csv" );
  my %output = %sizes;
  my %basic_sizes = %sizes;
  
  if ( $debug ) 
  {
    foreach my $item ( sort ( keys ( %sizes ) ) ) 
    {
      print ("directory size: $sizes{$item} = $item \n");
    }
  }
  while ( keys(%sizes) )
  {
    foreach my $value ( sort ( keys ( %sizes ) ) ) 
    {
      my $upd = $value;
      $upd =~ s,/[A-Za-z0-9_\.\,\- ]+$,,g;
      if ( $debug) { print "upd = $upd value = $value\n"; }
      if ( ! ( "$upd" eq "$value" ) )
      {
        $output{$upd} += $sizes{$value};
        $sizes{$upd} += $sizes{$value};
        if ( $debug ) { print "adding $value ( $sizes{$value} ) to $up
+d\n" }
      }
      delete($sizes{$value});
    }
  }
  if ( $debug ) 
  {
    foreach my $item ( sort ( keys ( %output ) ) )
    {
      print ("$output{$item} = $item \n");
    }
  }
  print REPORT "<?xml version=\"1.0\" encoding=\"ISO8859-1\" ?>\n";
  print REPORT "<?xml-stylesheet type=\"text/xsl\" href=\"/dusage/disk
+_usage.xsl\"?>\n";
  print REPORT "<ALL>\n";
  print REPORT "<TITLE>Disk usage report for $target </TITLE>\n";
  print REPORT "<DATE>",strftime("%d/%m/%y", localtime(time)), "</DATE
+>\n";
  print REPORT "<BYDIRECTORY>\n";
  print REPORT "<TITLE>Listing by Directory</TITLE>\n";

  my $current_indent = -1;

  foreach my $item ( sort ( keys ( %output ) ))
  {
    #if ( $output{$item} > $size_threshold )
    #{
      my $base_object = $item; 
      $base_object =~ s,.*/,/,g;
      my $indent_depth = ( $item =~ tr,/,, );
      #my $indent_html = join("", "<L DIR=\"",$base_object,"\" DEPTH=\
+"", $indent_depth,"\" SIZE=\"", data($output{$item}),"\">\n");
      my $indent_html = join("", 
                           "<L", $indent_depth,
                           " DIR=\"",
                           uri_escape($base_object,"^A-Za-z0-9\-_.!~+ 
+*'()\/"),
                           "\" SIZE=\"", data($output{$item}),
                           "\" DEPTH=\"", $indent_depth, "\">"
                            );
      #now we work out who 'owns' that data by doing substring matches
      #with the config array.
      
      my @dir_list = split("/", $item);
      my $owner = "";
      while ( !$owner && @dir_list )
      {
        my $srch_string = join("/", @dir_list);
                         #the 'dir' to look for in the customers array
+.
                         #might or might not have a trailing '/'
        $srch_string =~ 's,/$,,g'; #strip training /
        #if ( $debug ) { print "$item: checking for \"$srch_string\"\n
+"; }
        if ( $customers{$srch_string} ) 
       {
       $owner = $customers{$srch_string};
        }
        pop (@dir_list);
      }
      if ( $debug ) { print "$owner\n"; }
      if ( $debug ) { print "$indent_depth to $current_indent\n"; }
      if ( $indent_depth <= $current_indent ) 
      {  
        for ( my $i = $current_indent; $i >= $indent_depth; $i-- )
        {
           print REPORT "</L",$i,">\n";
        }
      }
      if ( $indent_depth > $current_indent + 1 )
      {
        for ( my $i = $current_indent + 1; $i > $indent_depth; $i++ )
        {
          print REPORT "<L",$i," PATH=\"more...\" DEPTH=\"$i\">";
        }
      }

      $current_indent = $indent_depth;
      print REPORT "\n",$indent_html,"\n";
      print REPORT "<SIZE>", data($output{$item}), "</SIZE>\n";
      print REPORT "<DIR>", uri_escape($base_object,"^A-Za-z0-9\-_.!~+
+ *'()\/"), "</DIR>\n";
      print REPORT "<FULL_PATH>", uri_escape($item,"^A-Za-z0-9\-_.!~+ 
+*'()\/"), "</FULL_PATH>\n";
      print REPORT "<BSIZE>", $output{$item}, "</BSIZE>\n";
      my ( $lcust, $ldept ) = split(":", $owner);
      if ( not $lcust ) { $lcust = "Unknown"; };
      if ( not $ldept ) { $ldept = "Unknown"; };
      print REPORT "<CUSTOMER>", $lcust, "</CUSTOMER>\n";
      print REPORT "<DEPT>", $ldept,"</DEPT>\n";

      if ( $debug ) 
      { 
        printf ("%${indent_depth}s", data($output{$item})); 
        print ("\t $base_object\n"); 
      }
      if ( !$owner ) { $owner = "unknown:unknown" };

      my ( $customer, $dept ) = split (":", $owner );
      unless ( $totals{$customer}{'Total'}{'du'} ) 
           { $totals{$customer}{'Total'}{'du'} = 0; }
      if ( $basic_sizes{$item} )
      {
        $totals{$customer}{'Total'}{'du'} += $basic_sizes{$item}; 
      }
      push ( @{$totals{$customer}{'Total'}{'dirs'}}, $item );
      if ( $debug ) { print "$customer $dept = $basic_sizes{$item}\n";
+ }
      if ( $basic_sizes{$item} )
      {
        $totals{$customer}{$dept}{'du'} += $basic_sizes{$item}; 
      }
      push ( @{$totals{$customer}{$dept}{'dirs'}}, $item ); 
      #push ( @customer_chain, join(" ", $owner, $item, data($output{$
+item}) ) );
    #} #if size
  }
  for ( my $i = $current_indent; $i >= 0; $i--)
  {
    print REPORT "</L",$i,">\n";
  }

  print REPORT "</BYDIRECTORY>\n\n<BYCUSTOMER>\n";
  if ( $debug ) { print "Listing by Customer and Department\n"; }
  print REPORT "<TITLE>Listing by Customer and Department</TITLE>\n";
  print CSV "Customer, Dept, total usage (bytes),\n";
  foreach my $customer ( sort ( keys ( %totals ) ) )
  {
    foreach my $dept ( sort ( keys ( %{$totals{$customer}} ) ) )
    {
      if ( $debug ) { print "$customer $dept ", data($totals{$customer
+}{$dept}{'du'}), "\n"; }
      #print REPORT "<TR><TD>$customer $dept ", data($totals{$customer
+}{$dept}{'du'}), "</TD></TR>\n";
      print CSV $customer,",",$dept,",",$totals{$customer}{$dept}{'du'
+},",\n";
      print REPORT "<CUSTOMER G=\"$customer $dept\"><NAME>$customer</N
+AME>";
      print REPORT "<DEPT>", $dept, "</DEPT>";
      print REPORT "<USAGE>", data($totals{$customer}{$dept}{'du'});
      print REPORT "</USAGE></CUSTOMER>\n";
    }
  }

print REPORT "</BYCUSTOMER></ALL>\n";
close REPORT;
close CSV;
}

#MAIN

#print get_size_of_files("test");

my ($idir) = @ARGV;

if ( -f $config_file ) 
{
  open ( CONF, "$config_file");
  while ( <CONF> )
  {
    chomp;
    my ( $fs, $cust, $dept ) = split(":");
    $customers{$fs} = join(":", $cust, $dept);
    if ( $debug ) { print "got $fs - $cust - $dept\n"; }
  }
  close ( CONF );
}

if ( $debug ) { print keys ( %customers ); } 

STDOUT -> autoflush(1);

if ( ! $idir )
{ 
  foreach my $dir ( @dirs )
  {
    if ( $debug ) { print "\nSTARTING $dir\n"; }
    dusage ( $dir, $global_recurse );
  }
  do_output ( "ALL" );
}
else
{
  dusage ( $idir, $global_recurse );
  do_output($idir);
}
[download]

Comment on Recursive file sizes Select or Download Code

Replies are listed 'Best First'.
�Re: Recursive file sizes by merlyn (Sage) on Sep 03, 2003 at 15:32 UTC
See "Disk Usage summarized", for a start on getting the hierarchical disk space. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Recursive file sizes by broquaint (Abbot) on Sep 03, 2003 at 14:27 UTC
Might I suggest also posting this to the newly opened¹ Code Review Ladder which is ripe for some code to dissect and review. Having opened just a few days ago there aren't many hard and fast rules or subscribers, but from the responses that've been posted so far I think you'll do well from it. HTH `_________ broquaint` ¹ see. Simon Cozens' blog for more info	[reply]
Re: Recursive file sizes by BUU (Prior) on Sep 03, 2003 at 15:15 UTC
While this isn't a perl solution, what about using something that (seems) to be designed for this task, namely a windows port of the linux program `du` and just using perl to parse it's output, perhaps combined with --max-depth= or something like that.	[reply]
Re: Re: Recursive file sizes by Preceptor (Deacon) on Sep 03, 2003 at 15:26 UTC
It doesn't explicitly need to be Windows. I'll try and get hold of a linux box and take a look at du. (At the moment I have a choice of Win2000 or Solaris, and the former I'd never really considered using perl, since I have the latter) I'd largely rejected du, primarily because in it's POSIX form it just prints the current level. (so in order to show sizes at each of the levels, you're effectively doing the job 5 times).	[reply]
Re: Re: Re: Recursive file sizes by BUU (Prior) on Sep 03, 2003 at 18:24 UTC
Dunno about the posix form, but the one that comes with cygwin atleast recurses through directories, giving me something like: `3.0k ./NetStorm/w 10M ./NetStorm/d 0 ./NetStorm/import 4.2M ./NetStorm/sound 2.2M ./NetStorm/help 32M ./NetStorm` [download]	[reply] [d/l]