Finding Redundant Files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding Redundant Files by Roy Johnson (Monsignor) on Feb 06, 2004 at 18:59 UTC
Compute a checksum for each file, and use the checksum as a key into a hash. The value of the hash will be an array (reference) storing the filenames that match that checksum. Then you can compare the contents of all the files with the same checksum. The PerlMonk `tr///` Advocate	[reply]
Re: Re: Finding Redundant Files by Limbic~Region (Chancellor) on Feb 06, 2004 at 19:19 UTC
Roy Johnson, Great idea. I would just expand it a little more. I would have a structure that looked like this: `my %mp3 = ( byname => {}; bymd5 => {}; );` [download] Again, as you stated each key in the secondary level hash would be an array reference to a list of matching files. The difference here is that you will also get a list of duplicate file names in different directories that may not be the same song. This can cause problems when you try to merge the directories. I would suggest the following modules: MD5 File::Find or File::Find::Rule File::Basename Cheers - L~R	[reply] [d/l]
Re: Re: Re: Finding Redundant Files by waswas-fng (Curate) on Feb 06, 2004 at 23:29 UTC
Because tags are stored inside the mp3 files, if you can't check for duplicates via the tags, md5 checksums will not help. Ie if you have a song with the title tag as "Yellow Sub" in one and "Yellow submarine" in another, even if the actual audio data portion of the mp3 is the exact same a md5 hash will show both files as being different. I would suggest using tag matching for exact duplicates and maybe a hash table using soundex or some variant on each tag to get a list of possible dups that you can hand ween through. -Waswas	[reply]
Re: Finding Redundant Files by Willard B. Trophy (Hermit) on Feb 06, 2004 at 20:19 UTC
You could, for each file, make a copy of the file with all tags stripped off, then create a hash of the real file name against the md5sum of the cleaned file. Any files with the same md5sum would have an identical MP3 audio stream. 'id3v2 --delete-all' might help with this. -- bowling trophy thieves, die!	[reply]
Re: Finding Redundant Files by arden (Curate) on Feb 06, 2004 at 19:23 UTC
I think to be as accurate as possible, you're going to have to go through a few cycles with this program. First off, as Roy Johnson put it, compute a checksum on every file and compare them. Any duplicate checksums are truly identical files. Next, you should compare the file-names to find potential duplicates that may not be tagged or may be inaccurately tagged. Finally, compare the ID tags to find duplicate copies of songs that may have different filenames and be slightly different (different remixes or maybe just missing the last few seconds or different bit-rates, etc...). Most importantly, I think you need to have your code output a file for a human to review, not do the deleting itself. If/when you complete it, I'd like to suggest that you post it on PM. I'm sure there are hundreds of others who could benefit from that!	[reply]
Re: Finding Redundant Files by Zero_Flop (Pilgrim) on Feb 07, 2004 at 06:55 UTC
Do a search for MP3::Tags and do a tag comparison. Comparing the MD5 hash will identify bit wise identical files. If the files are identical, the tags will be identical. So the MD5 would be redundant. If the tags were hand entered there may be some errors caused by spelling, but if they were pulled from the Net they should be pretty consistent. Pull the Tag file names, then normalize the tag names by setting all to CAPs. Also capture in your hash the size of the file. The larger the file the higher the bit rate (probably). Now you can get rid of all of the dups, but keep the highest quality copy. You can now rename the files to a consistent nomenclature.	[reply]
Re: Finding Redundant Files by zentara (Cardinal) on Feb 06, 2004 at 23:59 UTC
I've been using this dupfinder script to clean my midi file collection. It works great, but it dosn't recurse dirs. You could modify it. Merlyn has some code on the following page too, which would give you a good start. dupfinder and dupseek.	[reply]
Re: Re: Finding Redundant Files by abell (Chaplain) on Feb 07, 2004 at 10:31 UTC
Being the original author, I am glad of it :-) You can download the latest version from my web site. It only finds exact duplicates, so you could use something along the lines of the trick suggested by Willard B. Trophy to avoid comparing tag info. As a final note, dupseek does indeed recurse subdirectories, using File::Find. Cheers Antonio The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket	[reply]