Rather than making a simple hash of all files and their MD5s for each directory it would be more efficient if you made a HoA for each directory with the keys being file size and the values being anonymous arrays of files of that size. Then, rather than comparing MD5s of every file in one array against those in the other, you only need to compare sets of files of the same size; if the sizes differ the files can't be identical so there's no need to compare them! This has the effect of sharply reducing the number of comparisons you have to make. You could pare down each hash, removing keys that weren't common to both in order to remove files from consideration that could not possibly be duplicated.

This gives another efficiency gain because an inode lookup to get the size of a file is much cheaper than calculating an MD5 sum and this method means you only have to do expensive MD5s if you have file sets of the same size in each hash that must be compared.

Once you reach the comparison stage you could process the hashes a size at a time, following zwon's idea of creating hashes keyed by MD5 with anonymous arrays of filenames as the value. Or perhaps a HoHoA structure, something like

%fileset = ( '976ed3393d1e967b2d8b4432c92b1397' => { 'dirA' => [ 'fileA', 'fileC', ], 'dirB' => [ 'fileX', ], }, 'dc92b13976ed67b1e98b44322d339397' => { 'dirA' => [ 'fileG', ], }, );

I hope these ideas are helpful.

Cheers,

JohnGG

Update: Augmented the language in the first paragraph to make it clearer that it is the file size that determines whether a comparison is necessary or not.


In reply to Re: file comparison using file open in binary mode. by johngg
in thread file comparison using file open in binary mode. by Karger78

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.