in reply to file comparison using file open in binary mode.
Rather than making a simple hash of all files and their MD5s for each directory it would be more efficient if you made a HoA for each directory with the keys being file size and the values being anonymous arrays of files of that size. Then, rather than comparing MD5s of every file in one array against those in the other, you only need to compare sets of files of the same size; if the sizes differ the files can't be identical so there's no need to compare them! This has the effect of sharply reducing the number of comparisons you have to make. You could pare down each hash, removing keys that weren't common to both in order to remove files from consideration that could not possibly be duplicated.
This gives another efficiency gain because an inode lookup to get the size of a file is much cheaper than calculating an MD5 sum and this method means you only have to do expensive MD5s if you have file sets of the same size in each hash that must be compared.
Once you reach the comparison stage you could process the hashes a size at a time, following zwon's idea of creating hashes keyed by MD5 with anonymous arrays of filenames as the value. Or perhaps a HoHoA structure, something like
%fileset = ( '976ed3393d1e967b2d8b4432c92b1397' => { 'dirA' => [ 'fileA', 'fileC', ], 'dirB' => [ 'fileX', ], }, 'dc92b13976ed67b1e98b44322d339397' => { 'dirA' => [ 'fileG', ], }, );
I hope these ideas are helpful.
Cheers,
JohnGG
Update: Augmented the language in the first paragraph to make it clearer that it is the file size that determines whether a comparison is necessary or not.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: file comparison using file open in binary mode.
by Karger78 (Beadle) on Nov 30, 2009 at 18:13 UTC |