Removing Duplicates from Hash of Hash of id3 information

smgfc has asked for the wisdom of the Perl Monks concerning the following question:

Update: This replaces the old scheme of storing mp3s in a hash with theirs paths as keys with a hash of a hash of a hash of a hash, where the keys are, respectively, artist, album, title, and the final values are information on the track (path, size, time, track, and bitrate) as per arunhorne suggestion. The original post is at my scratchpad.

Well, here is my solution to the original problem, so now everything works well. If you have any other suggestions i would love to hear them, even if this is now semi-misplaced (SoPW vs. Code).

The only problem with this code is that you can't include duplicates (id3 wise) anymore even if you want to. I wouldnt be able to do that now, with out concecating a number onto the title key in the big hash of hash of hash of mp3 info, and that is just ugly. If there is another way I would love to hear it.

Another little issue would be sorting the tracks by track number when writing to a file. But i dont have a clue how to do that. Any help is more then welcome

#/usr/bin/perl -w

use strict;
use Getopt::Std;
use MP3::Info;
use File::Find;

my ($file, @dirs, $tmp, %mp3s, @duplicates, @lines, %options, $artist,
+ $album, $title);

getopts("f:m:rdh", \%options);

if ($options{h}) {
    print <<'eof';
    -f file: where to save the mp3 list
    -m dir: mp3 directory
    -r: rename the mp3 files according to there id3 tags
    -d: delete duplicates from the list of mp3s
    -h: this help document
eof
exit;
}

$file = $options{f} || "/Library/Webserver/Documents/mp3.info";
@dirs = $options{m} || ("/Volumes/Storage Drive/Music/");

find(\&id3info, @dirs);

sub id3info {
    return unless $File::Find::name =~ /^.+?\.[Mm][Pp]3$/;
    $tmp = get_mp3tag($File::Find::name);
    unless (exists $mp3s{$$tmp{'ARTIST'}}{$$tmp{'ALBUM'}}{$$tmp{'TITLE
+'}}) {
        %{$mp3s{$$tmp{'ARTIST'}}{$$tmp{'ALBUM'}}{$$tmp{'TITLE'}}} = (
            'PATH' => $File::Find::name,
            'TRACK' => $$tmp{'TRACK'},
            'TIME' => $$tmp{'TIME'},
            'SIZE' => $$tmp{'SIZE'},
            'BITRATE' => $$tmp{'BITRATE'}
        );
    } else {
        if ($mp3s{$$tmp{'ARTIST'}}{$$tmp{'ALBUM'}}{$$tmp{'TITLE'}}{'SI
+ZE'} < $$tmp{'SIZE'}) {
            push @duplicates, $mp3s{$artist}{$album}{$title}{'PATH'};
            %{$mp3s{$$tmp{'ARTIST'}}{$$tmp{'ALBUM'}}{$$tmp{'TITLE'}}} 
+= (
                'PATH' => $File::Find::name,
                'TRACK' => $$tmp{'TRACK'},
                'TIME' => $$tmp{'TIME'},
                'SIZE' => $$tmp{'SIZE'},
                'BITRATE' => $$tmp{'BITRATE'}
            );
         } else {
            push @duplicates, $File::Find::name;
         }
    }
        
}

if ($options{d}) {
    delete_duplicates(\@duplicates);
}

if ($options{r}) {
    rename_mp3s(\%mp3s);
}

foreach $artist (keys %mp3s) {
    foreach $album (keys %{ $mp3s{$artist} }) {
        foreach $title (keys %{ $mp3s{$artist}{$album} }) {
            push @lines, print $artist . '::' . $album . '::' . $title
+ . '::' . $mp3s{$artist}{$album}{$title}{'PATH'} . "\n";
        }
    }
}

open (FILE, ">$file") or die ("Can't open $file: $!\n");
foreach (@lines) {
    print FILE $_;
}
close (FILE);

sub rename_mp3s {
    my ($mp3s) = @_;
    my ($artist, $album, $title);
    foreach $artist (keys %mp3s) {
        foreach $album (keys %{ $mp3s{$artist} }) {
            foreach $title (keys %{ $mp3s{$artist}{$album} }) {
                $mp3s{$artist}{$album}{$title}{'PATH'} =~ /^(.+\/)/;
                rename $mp3s{$artist}{$album}{$title}{'PATH'}, $1 . $a
+rtist . ' - ' . $title . '.mp3';
            }
        }
    }
}

sub delete_duplicates {
    my ($duplicates) = @_;
    foreach (@$duplicates) {
        unlink($_) or warn "Couldn't delete $_: $!\n";
    }
}
[download]

Comment on Removing Duplicates from Hash of Hash of id3 information Download Code

Replies are listed 'Best First'.
Re: Removing Duplicates from Hash of Hash of id3 information by arunhorne (Pilgrim) on May 27, 2002 at 11:03 UTC
Hi smgfc, I answer to your suggestion to you the full path to an mp3 as the hash key this is not a good idea. Think about a file system. Each file at a given location occurs once and only once, therefore if the path is equal so too is the file. You won't be able to find duplicate song residing in different locations like this. My feeling is that you would be best to use a far purer key for the hashes. I suggest you create a first hash that contains just artist name, this maps to a second hash called song title, which maps to a third hash called track info (which can be used to store detail about the individual MP3 such as its location). What you are trying to build is a complex data structure so I would urge you to look at perlman:perldsc and perlman:perllol. I might even go so far as to say that creating an object to store in your hash might be a good idea to encapsulate data... perlman:perltoot and perlman:perlobj. Of course any solution relies on good ID3 tags for files so some sort of handling routine for when these don't contain enough info will be necessary. I.e. extract and check the ID3 before placing into hash, don't do it on the fly. But then thats always the fun of trying to organise an MP3 collection. As an aside the best tool I know for refactoring MP3 tags is Tag&Rename. Check it out... very handy. Anyway, hope this helps. ____________ Arun	[reply]
Re: Removing Duplicates from Hash of Hash of id3 information by jonknee (Monk) on May 27, 2002 at 13:45 UTC
My suggestion is that once you find out how to weed out the duplicates in perl (I can't help you there) write a quick script to print out all the dupes for your friend. That way he can delete them (if they are incomplete why keep them). You should keep the duplication detection in the main script too but weeding out the dupes every few weeks would be worthwhile -Jonathan Gales	[reply]