in reply to file comparison using file open in binary mode.

Rather than making a simple hash of all files and their MD5s for each directory it would be more efficient if you made a HoA for each directory with the keys being file size and the values being anonymous arrays of files of that size. Then, rather than comparing MD5s of every file in one array against those in the other, you only need to compare sets of files of the same size; if the sizes differ the files can't be identical so there's no need to compare them! This has the effect of sharply reducing the number of comparisons you have to make. You could pare down each hash, removing keys that weren't common to both in order to remove files from consideration that could not possibly be duplicated.

This gives another efficiency gain because an inode lookup to get the size of a file is much cheaper than calculating an MD5 sum and this method means you only have to do expensive MD5s if you have file sets of the same size in each hash that must be compared.

Once you reach the comparison stage you could process the hashes a size at a time, following zwon's idea of creating hashes keyed by MD5 with anonymous arrays of filenames as the value. Or perhaps a HoHoA structure, something like

%fileset = ( '976ed3393d1e967b2d8b4432c92b1397' => { 'dirA' => [ 'fileA', 'fileC', ], 'dirB' => [ 'fileX', ], }, 'dc92b13976ed67b1e98b44322d339397' => { 'dirA' => [ 'fileG', ], }, );

I hope these ideas are helpful.

Cheers,

JohnGG

Update: Augmented the language in the first paragraph to make it clearer that it is the file size that determines whether a comparison is necessary or not.

Replies are listed 'Best First'.
Re^2: file comparison using file open in binary mode.
by Karger78 (Beadle) on Nov 30, 2009 at 18:13 UTC
    JohnGG, thanks for the idea. However, this is just on a file per file base as the file could have been renamed but the size is the same. This is just a small application, not it won't be doing a mass amount of files. This is what i have come up with thus far. I build the hash's which works great. But I am still stumped on the compare. So first I tried to figure out which hash is bigger to use as the foreach loop so it will go though all the files. However, this is still not working. There must be an easy way that I am missing to compare two hashs (specifically the values) and add the difference to another hash/array that I could use.
    my %hash1; my %hash2; my $hash1Count=0; my $hash2Count=0; foreach my $FL (@remoteFilelist) { push @md51,md5sum($FL); $hash1{$FL} = md5sum($FL); $hash1Count++; } foreach my $FL2 (@return) { push @md52,md5sum($logSite.$FL2); $hash2{$logSite.$FL2} = md5sum($logSite.$FL2); $hash2Count++; } if ($hash1Count >= $hash2Count) { foreach my $key ( keys %hash1 ) { if (!exists($hash2{$key})) { my $temp = $hash2{$key}; push (@finalCompareArray, $temp); } } }else{ foreach my $key ( keys %hash2 ) { if (!exists($hash1{$key})) { my $temp = $hash1{$key}; push (@finalCompareArray, $temp); } } }