Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So I have a mass of mp3 files, in an unorganized collection of sub-directories. I know that I have multiple copies of the same songs in different places. Usually with the same file name, but not always. My goal then is to find duplicate copies of the same file in a directory hierarchy. The easy version would just compare file names, whereas the harder version would do a bit-wise comparison. I'm not even sure that a bit-wise comparison would work with MP3 (given the potential for tags and what not) but hey, why not try? My first thought then was that I would read the file structure into a hash and look for duplicates... but I'm not sure how to go about doing this intelligently. Can anyone help?

Replies are listed 'Best First'.
Re: Finding Redundant Files
by Roy Johnson (Monsignor) on Feb 06, 2004 at 18:59 UTC
    Compute a checksum for each file, and use the checksum as a key into a hash. The value of the hash will be an array (reference) storing the filenames that match that checksum. Then you can compare the contents of all the files with the same checksum.

    The PerlMonk tr/// Advocate
      Roy Johnson,
      Great idea. I would just expand it a little more. I would have a structure that looked like this:
      my %mp3 = ( byname => {}; bymd5 => {}; );
      Again, as you stated each key in the secondary level hash would be an array reference to a list of matching files. The difference here is that you will also get a list of duplicate file names in different directories that may not be the same song. This can cause problems when you try to merge the directories. I would suggest the following modules:

      Cheers - L~R

        Because tags are stored inside the mp3 files, if you can't check for duplicates via the tags, md5 checksums will not help. Ie if you have a song with the title tag as "Yellow Sub" in one and "Yellow submarine" in another, even if the actual audio data portion of the mp3 is the exact same a md5 hash will show both files as being different. I would suggest using tag matching for exact duplicates and maybe a hash table using soundex or some variant on each tag to get a list of possible dups that you can hand ween through.


        -Waswas
Re: Finding Redundant Files
by Willard B. Trophy (Hermit) on Feb 06, 2004 at 20:19 UTC
    You could, for each file, make a copy of the file with all tags stripped off, then create a hash of the real file name against the md5sum of the cleaned file. Any files with the same md5sum would have an identical MP3 audio stream.

    'id3v2 --delete-all' might help with this.

    --
    bowling trophy thieves, die!

Re: Finding Redundant Files
by arden (Curate) on Feb 06, 2004 at 19:23 UTC
    I think to be as accurate as possible, you're going to have to go through a few cycles with this program. First off, as Roy Johnson put it, compute a checksum on every file and compare them. Any duplicate checksums are truly identical files. Next, you should compare the file-names to find potential duplicates that may not be tagged or may be inaccurately tagged. Finally, compare the ID tags to find duplicate copies of songs that may have different filenames and be slightly different (different remixes or maybe just missing the last few seconds or different bit-rates, etc...).

    Most importantly, I think you need to have your code output a file for a human to review, not do the deleting itself. If/when you complete it, I'd like to suggest that you post it on PM. I'm sure there are hundreds of others who could benefit from that!

Re: Finding Redundant Files
by Zero_Flop (Pilgrim) on Feb 07, 2004 at 06:55 UTC
    Do a search for MP3::Tags and do a tag comparison.

    Comparing the MD5 hash will identify bit wise identical files. If the files are identical, the tags will be identical. So the MD5 would be redundant. If the tags were hand entered there may be some errors caused by spelling, but if they were pulled from the Net they should be pretty consistent.

    Pull the Tag file names, then normalize the tag names by setting all to CAPs. Also capture in your hash the size of the file. The larger the file the higher the bit rate (probably).

    Now you can get rid of all of the dups, but keep the highest quality copy. You can now rename the files to a consistent nomenclature.
Re: Finding Redundant Files
by zentara (Cardinal) on Feb 06, 2004 at 23:59 UTC
    I've been using this dupfinder script to clean my midi file collection. It works great, but it dosn't recurse dirs. You could modify it. Merlyn has some code on the following page too, which would give you a good start.

    dupfinder and dupseek.

      Being the original author, I am glad of it :-)

      You can download the latest version from my web site. It only finds exact duplicates, so you could use something along the lines of the trick suggested by Willard B. Trophy to avoid comparing tag info.

      As a final note, dupseek does indeed recurse subdirectories, using File::Find.

      Cheers

      Antonio



      The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket