Re: Comparing strings to reduce excess results

Regular expressions aren't necessary if you sort the hash keys first and take advantage of the hierarchical nature of Win32 pathnames. Sorting insures that the shortest path that has been created or deleted will always show up first and get added first to your final list of changes.

I'm not exactly sure what %current and %prior hold. Assuming that %current stores the results of comparing a past and current directory listing (created directories assigned the value 'created', deleted assigned the value 'deleted'), here is how you would show only the top most directory that has been created.

# always use strict and warnings - debugging for free!
use strict;
use warnings;

my %current = ('C:\1\x\3\4' => 'created',
               'C:\1\x\3'   => 'created',
               'C:\1\x'     => 'created'
               'C:\1\2\3\4' => 'deleted',
               'C:\1\2\3'   => 'deleted',
               'C:\1\2'     => 'deleted'
              );
my %unique;
node:
foreach my $node (sort keys %current) {
  #look for parent directories in %unique
  #skip this one if we find a match
  my @aNames = split(/\\/, $node);
  my $k='';
  for my $i (0..$#aNames) {
    $k .= '\\' if $i;
    $k .= $aNames[$i];
    next node if exists $unique{$k};
  }

  #no match, so this is the actual directory that moved
  $unique{$k} = $current{$k};
}

use Data::Dumper qw(Dumper);
print(Dumper(\%unique));

#outputs
#$VAR1 = {
#          'C:\\1\\x' => 'created'
#          'C:\\1\\2' => 'deleted'
#        };
[download]

To understand the above code, you may find it helpful to look up the following links in the Perl documentation:

strict - explains what use strict does and why you should use it
warnings - explains what use warnings does and why you should use it
sort - Perl sorting - provides a default, but also allows you to roll your own sorting rules.
split - break up a string into substrings
next - explains what next node means.
perlop - this is where to look for information about operators - I assume you already know what '.' means, but just in case ...
Data::Dumper - dumps the contents of a variable

Best, beth

Comment on Re: Comparing strings to reduce excess results Select or Download Code

Replies are listed 'Best First'.
Re^2: Comparing strings to reduce excess results by grub_ (Initiate) on Sep 16, 2009 at 05:23 UTC
Beth, That may all work but i need to modify my program to test it. Because I don't understand some of it the modification is going to be hard. Just a few questions: what is the line "node:" for? Is that where the "next node" directs? How does it test if something exists in $unique{$k} when there is nothing in there yet? Basically the whole for loop has me confused. I've actually started with a prior hash that has all the structure listed in the keys and the value as deleted when compared with the current hash which has keys as the structure but the values as created if when compared to prior it is a new structure. I fear my whole program is excessive long-winded babble. I've seen that i can use a library ChangeNotify to give live updates. But this script is only run every day or so to keep track of accidental changes. Any other ideas? # FIND CURRENT DIRECTORY STRUCTURE AND PRINT TO "CATALOG_CURRENT.TXT" sub myList { if ( -d ) { print CURRENT "$File::Find::name\n"; } } # GIVE "DELETED" VALUE TO KEYS IN "PRIOR" HASH THAT NO LONGER EXIST sub MarkNotExist { foreach $node (keys(%prior)) { ($prior{$node} = "deleted"), (++$change_count) if !exists $current{ +$node}; } } # GIVE "CREATED" VALUE TO KEYS IN "CURRENT" HASH THAT NOW EXIST sub MarkNewExist { foreach $node (keys(%current)) { ($current{$node} = "created"), (++$change_count) if !exists $prior{ +$node}; } } [download] Thanks.	[reply] [d/l]
Re^3: Comparing strings to reduce excess results by ELISHEVA (Prior) on Sep 16, 2009 at 08:33 UTC
what is the line "node:" for? You understand this correctly. `node:` is a loop label. Normally, `next` skips to the end of the nearest enclosing loop and begins the next iteration of that loop, but if you use `next` label, it goes to the end of whatever loop is labeled with label and begins the next iteration of the labeled loop. So `next node` says go to the end of the loop that is labeled "node", i.e. the outermost loop and begin working on the next node (file or directory name). For another example of using next with a label, see next. You may also be confused because there is both a label `node:` and a variable `$node`. They are very different things. I could have (and probably should have used `$fileOrDirName` instead of `$node` for the variable. Then they wouldn't look like they were the same. Maybe that would have been less confusing? How does it test if something exists in $unique{$k} when there is nothing in there yet? `exists $unique{$k}` is a bit confusing, I agree. Normally `$unique{$k}` means "get the value for key `$k`", but with exists, it has a special meaning: "look at the keys for `%unique` and see if any of them are `$k`". See exists. I've actually started with a prior hash that has all the structure listed in the keys and the value as deleted when compared with the current hash which has keys as the structure but the values as created Your current solution has two problems. The first is that is awfully complicated to work with. Why not put both the notifications of creation and deletions into a single hash? e.g. `$changes{$filename} = $updatetype`. Then you can get rid of all of that ugly cross referencing between the two hashes You don't really need separate hashes even if later on in your program you just need a list of deletions or creations. If you want to get just the creation notifications from `%changes`, you can use the wonderful Perl function grep, i.e. `@aCreated=grep { $changes{$_} eq 'created' } keys %changes`. If you want just the deletion notifications, you can get `@aDeleted=grep { $changes{$_} eq 'deleted'} keys %changes.` The more important problem is that the two hash solution will break if you have multiple changes to the same file during the day or two between running your report. When that happens, you'll need to know the order of events, e.g. 'create', 'delete' would amount to no change at all. 'create', 'delete', 'create' would amount to 'create'. Using two hashes you'd end up with only 1 create and 1 delete even though in fact you had 'create', 'delete', 'create'. This would falsely make you think you had no net change since all you would see is one delete and one create. Putting everything in one array can solve that problem, but you will have to use a HoA structure (a hash of array references). This is all getting a bit complicated if you are first learning Perl, even more so if you are first learning programming at the same time! I would recommend you first fully understand how to solve this problem with a single hash created via `$changes{$filename}=$updatetype`. Then we can work on converting it to be more robust when there are multiple changes to the same file during your monitoring period. When you are ready to make your program more robust, you'll need to convert your simple hash to a hash of arrays (see perldsc). You do this by changing the way you update the hash when you get your notifications. I'll give you a brief preview, but don't get too worried if it all sounds Greek. There are a lot of different concepts here. Some people like advanced peeks so they can see where they are going. Some people like to understand an easier solution well and then read up on the more complicated solution. Only you know your style. So here is the preview. Instead of setting `$changes{$filename}=$updatetype`, you will record the changes with the formula `push @{$changes{$filename}}, $updatetype` This complicated mess tells Perl to check to see if `$changes{$filename}` has a defined value. If it doesn't Perl has a special process called "autovivification" that says "I can tell from context you need an array reference here, so I'll assign one to `$changes{$filename}` for you". Now that you have an array reference assigned `@{...}` says "convert the array reference into an array". push adds an element to the end of the array. The net result of all of this is that `$changes{$filename}` contains a list of all of the updated actions in the order in which they occurred. When you are ready to run your report, you will want to skip all of the changes that cancel themselves out. To do that, you would want to write a small function that scans an array reference and checks to see if there are an equal number of create and delete actions. If so it returns undef. If not it returns 'delete' or 'create'. Lets call that subroutine, `getNetChange`. Once you have written this routine, you would place the statement `next node unless getNetChange($node);` at the top of your foreach loop right before you split up `$node` into pathname components. One final note: you may want to rethink using notifications to track changes over time. Notifications are designed primarily to trigger actions that will happen immediately after the change. The usual way of tracking changes over an extended period of time is to do a full recursive directory listing on day 1 and then again on day 2. Then you write a script to compare the two listings. Things in the old listing but not the new would be added to a change list with 'delete'. Things in the new list, but not the old would be added with value 'create'. If you did this you could eliminate the need entirely for a HoA (hash or array). Also, why do you need to write this program? If you can put these files and directories under version control, a lot of this tracking can be done for you automatically. You'd also have the ability to roll back changes as you needed, generate reports. If you are trying to keep two disks in sync, then there is excellent mirroring software that can keep track of changes and only update things that have changed. May not suit your situation, but if you need to monitor changes to a directory structure, one might want to consider these alternatives. I don't know if this answers all of your questions (it probably doesn't), but if not, feel free to ask more. Best, beth Update: added some questions about the strategy of using notifications to track changes.	[reply] [d/l] [select]
Re^4: Comparing strings to reduce excess results by grub_ (Initiate) on Sep 21, 2009 at 05:28 UTC
That is a lot information, but I fear that I have not explained my problem correctly. I'll try again :). This is some example output: the last list of directories is just a dump of the hash that has all the changes listed incase it makes it easier to view. ----------------------------------------------------------- PRIOR RUNTIME WAS Thu Sep 17 16:09:05 2009 ----------------------------------------------------------- ###################### THESE DIRECTORIES NO LONGER EXIST ----------------------------------------------------------- C:\Temp/2/2 C:\Temp/4 C:\Temp/4/1 C:\Temp/4/2 C:\Temp/5/1 C:\Temp/5/1/1 C:\Temp/5/1/2 C:\Temp/5/1/2/1 C:\Temp/5/1/2/2 ----------------------------------------------------------- ##################### THESE DIRECTORIES HAVE BEEN CREATED ----------------------------------------------------------- C:\Temp/4-4 C:\Temp/4-4/1 C:\Temp/4-4/1/New C:\Temp/4-4/2 C:\Temp/5/2/1 C:\Temp/5/2/1/1 C:\Temp/5/2/1/2 C:\Temp/5/2/1/2/1 C:\Temp/5/2/1/2/2 C:\Temp/New ----------------------------------------------------------- ############## THESE DIRECTORIES MAY HAVE BEEN RE-NAMED OR MOVED # ----------------------------------------------------------- ----------------------------------------------------------- ############################### HAVE A NICE DAY C:\Temp/2/2 C:\Temp/4 C:\Temp/4-4 C:\Temp/4-4/1 C:\Temp/4-4/1/New C:\Temp/4-4/2 C:\Temp/4/1 C:\Temp/4/2 C:\Temp/5/1 C:\Temp/5/1/1 C:\Temp/5/1/2 C:\Temp/5/1/2/1 C:\Temp/5/1/2/2 C:\Temp/5/2/1 C:\Temp/5/2/1/1 C:\Temp/5/2/1/2 C:\Temp/5/2/1/2/1 C:\Temp/5/2/1/2/2 C:\Temp/New [download] I'm pretty happy with the output except I just need to reduce the multiple listings for sub-directories when a parent directory is changed, moved, deleted or what ever. Do you think it is possible to just trim the results for multiple parents? Thanks so much for your help, I've learnt a lot.	[reply] [d/l]