Re: Finding Redundant Files
by Roy Johnson (Monsignor) on Feb 06, 2004 at 18:59 UTC
|
| [reply] |
|
|
Roy Johnson,
Great idea. I would just expand it a little more. I would have a structure that looked like this:
my %mp3 = (
byname => {};
bymd5 => {};
);
Again, as you stated each key in the secondary level hash would be an array reference to a list of matching files. The difference here is that you will also get a list of duplicate file names in different directories that may not be the same song. This can cause problems when you try to merge the directories. I would suggest the following modules:
Cheers - L~R
| [reply] [d/l] |
|
|
Because tags are stored inside the mp3 files, if you can't check for duplicates via the tags, md5 checksums will not help. Ie if you have a song with the title tag as "Yellow Sub" in one and "Yellow submarine" in another, even if the actual audio data portion of the mp3 is the exact same a md5 hash will show both files as being different. I would suggest using tag matching for exact duplicates and maybe a hash table using soundex or some variant on each tag to get a list of possible dups that you can hand ween through.
| [reply] |
Re: Finding Redundant Files
by Willard B. Trophy (Hermit) on Feb 06, 2004 at 20:19 UTC
|
| [reply] |
Re: Finding Redundant Files
by arden (Curate) on Feb 06, 2004 at 19:23 UTC
|
I think to be as accurate as possible, you're going to have to go through a few cycles with this program. First off, as Roy Johnson put it, compute a checksum on every file and compare them. Any duplicate checksums are truly identical files. Next, you should compare the file-names to find potential duplicates that may not be tagged or may be inaccurately tagged. Finally, compare the ID tags to find duplicate copies of songs that may have different filenames and be slightly different (different remixes or maybe just missing the last few seconds or different bit-rates, etc...).
Most importantly, I think you need to have your code output a file for a human to review, not do the deleting itself. If/when you complete it, I'd like to suggest that you post it on PM. I'm sure there are hundreds of others who could benefit from that! | [reply] |
Re: Finding Redundant Files
by Zero_Flop (Pilgrim) on Feb 07, 2004 at 06:55 UTC
|
Do a search for MP3::Tags and do a tag comparison.
Comparing the MD5 hash will identify bit wise identical files. If the files are identical, the tags will be identical. So the MD5 would be redundant. If the tags were hand entered there may be some errors caused by spelling, but if they were pulled from the Net they should be pretty consistent.
Pull the Tag file names, then normalize the tag names by setting all to CAPs. Also capture in your hash the size of the file. The larger the file the higher the bit rate (probably).
Now you can get rid of all of the dups, but keep the highest quality copy.
You can now rename the files to a consistent nomenclature.
| [reply] |
Re: Finding Redundant Files
by zentara (Cardinal) on Feb 06, 2004 at 23:59 UTC
|
I've been using this dupfinder script to clean my midi file collection. It works great, but it dosn't recurse dirs. You could modify it. Merlyn has some code on the following page too, which would give you a good start.
dupfinder
and dupseek. | [reply] |
|
|
Being the original author, I am glad of it :-)
You can download the latest version from my web site. It only finds exact duplicates, so you could use something along the lines of the trick suggested by Willard B. Trophy to avoid comparing tag info.
As a final note, dupseek does indeed recurse subdirectories, using File::Find.
Cheers
Antonio
The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket
| [reply] |