in reply to Manage Directory Structure

Can we get a little more info? A more detailed example? Have you tried anything yet?

I can't figure out what you actually want... Do you want to create one file per directory, such that this one file contains the concatenation of all data found in all other data files in that directory? Sounds like maybe you want to create tar files? (If not, then what? And why?)

Update: what is it about your directory that's "too big"? File count? Total byte count? (Both?) How big is "too big" (what sort of numbers are you looking at)?

Replies are listed 'Best First'.
Re^2: Manage Directory Structure
by boardryder (Novice) on Jul 04, 2009 at 03:04 UTC
    My ultimate goal is to recursively scan all of a specified directories contents on day x and on day x+1 scan again. Then compare day x to x+1 for differences in files(new/missing), file size, directory size, directory content count and report the differences.

    My solution below seems to be a good start, but my full directory listing creates 1G flat files that cannot be read into a hash (hash creation causes program to die for running out of memory).

    "Do you want to create one file per directory, such that this one file contains the concatenation of all data found in all other data files in that directory?"
    I want a listing of directory contents in each directory file

    Any other suggestions on my methods to end goal are welcomed :) Thank You!

    #!/usr/local/bin/perl -w use Date; use File::Find; use Data::Dumper; my $d = Date->new(); my $today=$d->yyyymmdd; my $yesterday=$d->yesterday; my $progname = `basename $0`; chomp($progname); my $main = "/home/users/user"; my $out = "/home/users/user/tmp/$progname_$today.dat"; my $comp = "/home/users/user/tmp/$progname_$yesterday.dat"; my $final = "/home/users/user/tmp/$progname_compare.dat"; my $skip = qw/skip1|skip2|skip3/; my $max_size_diff = "25"; my $max_file_diff = "0"; ## Main ## open (OUT, ">$out"); find ({wanted => \&data_for_path, follow=>1, follow_skip=>2}, call_dir +($p=1), call_dir($p=2), call_dir($p=3)); close OUT; process(); ## Builds the directory structure by using File::Find to recurse into +each subdir## sub data_for_path { $size_d = 0; if ($File::Find::name =~ /$skip/){return;} if (-d $File::Find::name){ $directory = $File::Find::name; print OUT "$directory\tDirectory\tNULL\tNULL\tNULL\n"; } if (-f $File::Find::name){ my $file = $File::Find::name; my $size_f = -s $file; $size_d += -s $file; print OUT "$file\tFile\t$size_f\tNULL\t0\n"; } if (-l $File::Find::name){ my $name = $File::Find::name; my $link = readlink $name; my $size_l = -s $link; print OUT "$name\tLink\t$size_l\t$link\t0\n"; } } ## Calls wich directory path to use from the main ## sub call_dir{ if ($p == 1){ $sub_dir = "$main/tmp/"; } elsif ($p == 2){ # $sub_dir = "$main/data"; } elsif ($p == 3){ $sub_dir = "$main/exe"; } } ## Processes flat files ## sub process { open (TODAY_IN, $out); foreach my $line_t (<TODAY_IN>){ ($path, $type, $size, $link, $nfiles) = split(/\s+/, $line_t); $today{$path}{'Type'} = $type; $today{$path}{'Size'} = $size; $today{$path}{'Link'} = $link; $today{$path}{'NFILES'} = $nfiles; } close TODAY_IN; open (YESTERDAY_IN, $comp); foreach my $line_y (<YESTERDAY_IN>){ ($path, $type, $size, $link, $nfiles) = split(/\s+/, $line_y); $yesterday{$path}{'Type'} = $type; $yesterday{$path}{'Size'} = $size; $yesterday{$path}{'Link'} = $link; $yesterday{$path}{'NFILES'} = $nfiles; } close YESTERDAY_IN; # print Dumper %today; # print Dumper %yesterday; diff(%today, %yesterday); print Dumper %diffOut; } ## Diffs todays directory structure to yesterdays structure ## sub diff { open (COMP, ">$final"); foreach $key (keys %today){ if (exists $yesterday{$key}){ $size_t = $today{$key}{'Size'}; $size_y = $yesterday{$key}{'Size'}; $nfiles_t = $today{$key}{'NFILES'}; $nfiles_y = $yesterday{$key}{'NFILES'}; if ($size_y > 0 && $size_t > 0){ if ($size_t > $size_y){ my $diff_t = (1-($size_y/$size_t))*100; if ($diff_t >= $max_size_diff){ $diffOut{$key}{'SizeYest'} = $size_y; $diffOut{$key}{'SizeToday'} = $size_t; $diffOut{$key}{'SizeDiff'} = $diff_t; print COMP "$key\tYEST:$diffOut{$key}{'SizeYest'}\tT +OD:$diffOut{$key}{'SizeToday'}\tDIFF:$diffOut{$key}{'SizeDiff'}\n"; } } elsif ($size_y > $size_t){ my $diff_y = (1-($size_t/$size_y))*100; if ($diff_y >= $max_size_diff){ $diffOut{$key}{'SizeToday'} = $size_t; $diffOut{$key}{'SizeYest'} = $size_y; $diffOut{$key}{'SizeDiff'} = $diff_y; print COMP "$key\tYEST:$diffOut{$key}{'SizeYest'}\tT +OD:$diffOut{$key}{'SizeToday'}\tDIFF:$diffOut{$key}{'SizeDiff'}\n"; } } if (-d $key){ if ($nfiles_y > 0 && $nfiles_t > 0){ $diffFiles = $nfiles_t-$nfiles_y; if ($diffFiles > $max_file_diff){ $diffOut{$key}{'FileDiff'} = $diffFiles; print COMP "$key\tFDIFF:$diffOut{$key}{'FileDiff' +}\n"; } } } } } else { $diffOut{$key}{'SizeToday'} = $size_t; $diffOut{$key}{'SizeYest'} = 0; $diffOut{$key}{'SizeDiff'} = "New"; print COMP "$key\tYEST:$diffOut{$key}{'SizeYest'}\tTOD:$diffO +ut{$key}{'SizeToday'}\tDIFF:$diffOut{$key}{'SizeDiff'}\n"; } } close COMP; print "Done!\n"; }

      first off: always use strictures (use strict; use warnings;).

      my $out = "/home/users/user/tmp/$progname_$today.dat";

      should be

      my $out = "/home/users/user/tmp/${progname}_$today.dat";

      or $progname_ will be seen as the variable name rather than $progname. Strictures should also force you to think a little more about lifetime of variables and how you pass information around to different parts of your program.

      call_dir($p=1) ... sub call_dir { if ($p == 1) ... <p>would be better as:</p> <c> call_dir(1) ... sub call_dir { my ($p) = @_; if ($p == 1)

      although passing an array into the find call would be even better:

      my $main = "/home/users/user"; my @rootDirs = ("$main/tmp/", "$main/exe"); ... find ({wanted => \&data_for_path, follow=>1, follow_skip=>2}, @rootDir +s);

      and avoids the interesting mismatch between the calls to call_dir and the implementation (3 calls, 2 valid return values).

      However, I'd be inclined to read and parse today and yesterday's files in parallel to detect differences. That avoids having any more than a few lines of data in memory at any given time, but may make the parsing a little more interesting.

      If instead you read the files in parallel as suggested above, but load a complete directory of files at a time into two arrays of lines (one for today's files and one for yesterday's) you could then use Algorithm::Diff to do the heavy lifting in differencing the two file sets. That limits the data in memory to one directory worth of files (x 2 - one image for each day), but probably simplifies the diff parsing substantially.


      True laziness is hard work
      You need to consider a different approach -- something that will make efficient use of existing tools for doing basic things, and that will reduce the comparison problem to a simple matter of string diffs between two plain-text listings (i.e. using the standard "diff" tool that comes with unix/linux). There's no need to have 1 GB data structures in memory.

      How about breaking the problem down to three separate procedures:

      1. Create a sorted list of all the directories of interest on each scan.
      2. For each directory, create a separate sorted list for the symlinks and data files in that direcory.
      3. To find differences between yesterday and today, use the standard unix/linux "diff" tool on consecutive directory lists, and on consecutive file lists for each directory.

      File::Find will be good for the first step, though you might want to consider just using the available unix/linux tools:

      find /path/of_interest -type d | sort > /path/for_log_info/dirlist.y +ymmdd

      Using "diff" on two consecutive "dirlist" files will reveal the addition or removal of directories.

      For step 2, I would do something like:

      open( DLIST, "<", $dirlist ); while ( my $dir = <DLIST>) { chomp $dir; opendir D, $dir or do { warn "opendir failed on $dir: $!\n"; next; }; ( my $file_list_name = $dir ) =~ tr{/}{%}; open( FLIST, ">", "$log_path/$file_list_name.$today" ) or die "cannot write to $log_path/$file_list_name.$today: $!\n +"; for my $file ( sort grep { !-d "$dir/$_" } readdir( D )) { # check for symlink vs. datafile # gather other stat info as needed, # print a nicely formatted line to FLIST } close FLIST; closedir D; } close DLIST;
      With that done, running the basic "diff" command on two consecutive file listings for a given directory (assuming that the directory existed on both days) will tell you which files changed, which were added, and which were removed. Just figure out what you want to do with the output from "diff".
        GrandFather thanks for the advice on proper programming I cleaned up a lot of that code. I've become too lazy and that gave me a little kick. I'm interested in Algorithm::Diff but will need time to investigate.

        graff that's just way too simple I'm incredibly impressed and I've already got a good start on it. I like it, and it seems incredibly fast, however I find a few caveats. I need to use find -follow due to my file structure(need to follow links pointing to other partitions), but I receive loads of errors when it finds null links. How can I manage this and output to a file or is there a different approach. Also, I had to use an egrep -v to filter out directories I don't want. Any other solution as I always try to stay away from using system commands within perl?

        UPDATE: If I use find -type d -follow and it detects a change while scanning it dies and the rest of my program continues on. Thus leaving me with only a partial directory scan. I'm trying to work in File::Find, but any other suggestions?