in reply to Finding Redundant Files

Compute a checksum for each file, and use the checksum as a key into a hash. The value of the hash will be an array (reference) storing the filenames that match that checksum. Then you can compare the contents of all the files with the same checksum.

The PerlMonk tr/// Advocate

Replies are listed 'Best First'.
Re: Re: Finding Redundant Files
by Limbic~Region (Chancellor) on Feb 06, 2004 at 19:19 UTC
    Roy Johnson,
    Great idea. I would just expand it a little more. I would have a structure that looked like this:
    my %mp3 = ( byname => {}; bymd5 => {}; );
    Again, as you stated each key in the secondary level hash would be an array reference to a list of matching files. The difference here is that you will also get a list of duplicate file names in different directories that may not be the same song. This can cause problems when you try to merge the directories. I would suggest the following modules:

    Cheers - L~R

      Because tags are stored inside the mp3 files, if you can't check for duplicates via the tags, md5 checksums will not help. Ie if you have a song with the title tag as "Yellow Sub" in one and "Yellow submarine" in another, even if the actual audio data portion of the mp3 is the exact same a md5 hash will show both files as being different. I would suggest using tag matching for exact duplicates and maybe a hash table using soundex or some variant on each tag to get a list of possible dups that you can hand ween through.


      -Waswas