comment on

I tackled a new (to me) problem today and I thought I'd share my solution. I've never dealt with this particular problem before and I don't remember reading about it. I'd be interested to hear if anyone has solved it in a different way or can think of a better solution.

Problem: I have 20 directories each containing around 50 files. At one point in the past these directories all contained the same 50 files with the same contents. Since then changes have been made, files have been added and files have been deleted. I need to get an overview of which files have changed and which files are still the same. I don't have a copy of the original set of files, all I have is the current set of 20 directories.

My first thought was to use diff, but I quickly realized I'd be drowning in data. I don't need to look at the minutia yet, I just need to get an idea of how much variation there is.

My next idea, which I implemented, was to generate a table like:

dir1 dir2 dir3

file1 A A A

file2 A B C

file3 A

In the above table all three directories have identical copies of file1, and all have different copies of file2. Only dir1 has a file3.

I decided that instead of generating an HTML table I'd produce a CSV and load that in OpenOffice Calc (and when that crashed on me, Excel). Then I can produce pretty output for the managers to explain fully why making a "simple" change across all these directories will take so long.

To do the actual comparisions I used Digest::MD5 to compute MD5 sums for each file. Then I produced the letter designations by creating a hash of MD5s for each row. Here's the code I used:

#!/usr/bin/perl -w
use File::Find;
use Digest::MD5 qw(md5_hex);

my @dirs = sort glob("cms*");
my %files;

foreach my $dir (@dirs) {
    chdir($dir) or die $!;
    my @files = sort (glob("*.tmpl"), glob("*.pl"));
    foreach my $file (@files) {
        open(my $fh, '<', $file) or die $!;
        my $text = join('', <$fh>);
        my $md5 = md5_hex($text);
        $files{$file}{$dir} = $md5;
    }
                     
    chdir('..') or die $!;
}

print ',', join(', ', @dirs), "\n";
foreach my $file (sort keys %files) {
    my %key;
    my $next = 'A';
    my @row = $file;
    foreach my $dir (@dirs) {
        my $md5 = $files{$file}{$dir};
        if ($md5) {
            push(@row, ((exists $key{$md5}) ? 
                        ($key{$md5}) : ($key{$md5} = $next++)));
        } else {
            push(@row, '');
        }
    }
    print join(', ', @row), "\n";
}
[download]

Note that the code assumes a couple things peculiar to my problem - the target directories start with "cms" and the files I'm interested in end in ".pl" and ".tmpl".

The end result provided me with a number of useful insights into the overall variety of the data. I can now apply diff to examine particular changes and use the formatted output to justify my estimates.

-sam

In reply to Multi-directory Change Reporter by samtregar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.