in reply to Re^2: Manage Directory Structure
in thread Manage Directory Structure

You need to consider a different approach -- something that will make efficient use of existing tools for doing basic things, and that will reduce the comparison problem to a simple matter of string diffs between two plain-text listings (i.e. using the standard "diff" tool that comes with unix/linux). There's no need to have 1 GB data structures in memory.

How about breaking the problem down to three separate procedures:

  1. Create a sorted list of all the directories of interest on each scan.
  2. For each directory, create a separate sorted list for the symlinks and data files in that direcory.
  3. To find differences between yesterday and today, use the standard unix/linux "diff" tool on consecutive directory lists, and on consecutive file lists for each directory.

File::Find will be good for the first step, though you might want to consider just using the available unix/linux tools:

find /path/of_interest -type d | sort > /path/for_log_info/dirlist.y +ymmdd

Using "diff" on two consecutive "dirlist" files will reveal the addition or removal of directories.

For step 2, I would do something like:

open( DLIST, "<", $dirlist ); while ( my $dir = <DLIST>) { chomp $dir; opendir D, $dir or do { warn "opendir failed on $dir: $!\n"; next; }; ( my $file_list_name = $dir ) =~ tr{/}{%}; open( FLIST, ">", "$log_path/$file_list_name.$today" ) or die "cannot write to $log_path/$file_list_name.$today: $!\n +"; for my $file ( sort grep { !-d "$dir/$_" } readdir( D )) { # check for symlink vs. datafile # gather other stat info as needed, # print a nicely formatted line to FLIST } close FLIST; closedir D; } close DLIST;
With that done, running the basic "diff" command on two consecutive file listings for a given directory (assuming that the directory existed on both days) will tell you which files changed, which were added, and which were removed. Just figure out what you want to do with the output from "diff".

Replies are listed 'Best First'.
Re^4: Manage Directory Structure
by boardryder (Novice) on Jul 05, 2009 at 04:34 UTC
    GrandFather thanks for the advice on proper programming I cleaned up a lot of that code. I've become too lazy and that gave me a little kick. I'm interested in Algorithm::Diff but will need time to investigate.

    graff that's just way too simple I'm incredibly impressed and I've already got a good start on it. I like it, and it seems incredibly fast, however I find a few caveats. I need to use find -follow due to my file structure(need to follow links pointing to other partitions), but I receive loads of errors when it finds null links. How can I manage this and output to a file or is there a different approach. Also, I had to use an egrep -v to filter out directories I don't want. Any other solution as I always try to stay away from using system commands within perl?

    UPDATE: If I use find -type d -follow and it detects a change while scanning it dies and the rest of my program continues on. Thus leaving me with only a partial directory scan. I'm trying to work in File::Find, but any other suggestions?
      If your "path of interest" contains symlinks to directories on other disk volumes or partitions, and if it turns out that some of the symlinks point to non-existent or unavailable paths (e.g. pointing to a volume that isn't currently mounted), one solution would be to stick with my 3-step procedure (and still don't use  -follow in the find command at step 1), but subdivide step 2 into a few separate steps:

      • 2. For each directory, scan its immediate contents with readdir()

        • 2.a. Accumulate data files and symlinks in separate lists

        • 2.b. For each element of the symlink list, determine its target, determine whether the target exists, and if so determine whether it is a data file, a directory, or another symlink. The output listing for the directory currently being scanned should contain just this information about the symlinks.

        • 2.c. For each element of the data file list, get the other stat info you need and report this in the listing for the current directory.

      Based on that treatment, you'll know whether symlinks have been added or removed from within the directory tree of interest, you'll know which ones are broken, and for the ones that work, you'll know what sort of thing they point to, and (comparing listings from consecutive days) whether there has been a change in their target path. That should be all you need to know about symlinks per se.

      For the ones that work, you won't know from the symlink listing whether the content of the target path has changed (i.e. change of a data file or change of directory contents). But if the target is a datafile or directory within the current tree of interest, that information will be available elsewhere in your overall output.

      And if the target is on some other disk volume/partition, running this same process on that volume (on the relevant directory tree of interest) will tell you what you want to know about that content.

      There will be some serious work in keeping all this information organized and managed properly, to make sure that everything gets covered with (ideally) no redundancy -- e.g. a process that scans the output listings for a given directory tree and launches this same process on other volumes as needed to cover all the cross-volume symlinks. I'll leave that as an exercise. ;)

      (Good luck with the case of "symlink points to symlink points to symlink...", and with cases of "relative" as opposed to "absolute" target paths. You may need to do the equivalent of bash's "pushd / popd" within your perl script to test for target existence of relative-path symlinks.)

        I followed graff's advice and all seems to be working well so far after breaking out a new file for each directory. I took one shortcut to speed up implementation. I use File::Find to do my traversal and there are certain directories I want to skip. I add in if ($File::Find::name =~ /$skip/){return;} in my &wanted. This works and though it's not output File::Find continues its' recursion into my undesired directory thus eating precious processing time.

        I figure my only option is to not use File::Find, or is there a way around this?