Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Remove Duplicate Files

by jfroebe (Parson)
on Oct 29, 2004 at 02:38 UTC ( [id://403580]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info

Jason L. Froebe
jason@froebe.net

Description:

Searches a list of directories provided on the command line and removes duplicates. It remembers previous runs (compressed delimited file) and is able to remove 'cache' entries that point to nonexistant files.

A summary is printed

Loaded 93031 entries

TOTAL files: 93030
Added files: 0
Deleted files: 0
Files not found: 0

#!/usr/bin/perl

use warnings;
use strict;

use Digest::MD5;
use File::Find;
use PerlIO::gzip;

use vars qw/ $md5_file_ref $filename_md5_ref 
         $file_count $file_add $file_del 
         $file_lost /;

sub find_found;
sub load_md5 ($);
sub save_md5 ($);

sub save_md5 ($) {
 my $FILENAME = shift;

 my $FILE = ();
 open $FILE, ">:gzip", $FILENAME;

 foreach (keys %$md5_file_ref) {
   printf $FILE "%s|||%s\n", $_, $md5_file_ref->{$_};
 }
 close $FILE;
}

sub load_md5 ($) {
 my $FILENAME = shift;

 my $FILE = ();
 my $count = 1;

 open $FILE, "<:gzip", $FILENAME;

 while (<$FILE>) {
   chomp;
   my ($tmp_col1, $tmp_col2) = split '\|\|\|', $_;
   $md5_file_ref->{$tmp_col1} = $tmp_col2;
   $filename_md5_ref->{$tmp_col2} = $tmp_col1;
   $count++;
 }
 close $FILE;
 print "Loaded $count entries\n";
}

sub verify_files {
 foreach my $FILE (keys %$filename_md5_ref) {
   unless (-f $FILE) {
    my $md5 = $filename_md5_ref->{$FILE};
    delete $md5_file_ref->{$md5};
    delete $filename_md5_ref->{$FILE};
    print "   *** Not found: $FILE\n";
    $file_lost++;
   }
 }
}

sub find_found {
 my $FILE = $_;

 my $file_md5 = ();

 if ( -r $FILE && -f $FILE) {
   unless ($filename_md5_ref->{$FILE}) {
     open(FILE, $FILE)
       or return;
     binmode(FILE);
     $file_md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
     close(FILE);
     if ( $md5_file_ref->{$file_md5}) {
       chmod(0666, $FILE);
       if (unlink $FILE) {
         print "\n",
           "     *** DELETING ***\n",
           "   Duplicate file: $FILE\n",
           "     *** DELETING ***\n\n";
         $file_del++;
       } else {
         warn "Unable to delete $FILE\n\n";
       }
      } else {
       print "Added $file_md5 $FILE\n";
       $md5_file_ref->{$file_md5} = $FILE;
       $filename_md5_ref->{$FILE} = $file_md5;
       $file_add++;
     }
  }
  $file_count++;
 }
}

#####################
#####################
#####################

$file_count = 0;
$file_del = 0;
$file_add = 0;
$file_lost = 0;

my $FILE = "dups.csv.gz";

if ( -r $FILE && -f $FILE) {
 load_md5($FILE);
 verify_files;
}

find {
 bydepth         => 1,
 no_chdir        => 1,
 wanted          => \&find_found
} => @ARGV;

print "\nTOTAL files: $file_count\n";
print "  Added files: $file_add\n";
print "  Deleted files: $file_del\n";
print "  Files not found: $file_lost\n\n";
save_md5($FILE);
Replies are listed 'Best First'.
Re: Remove Duplicate Files
by gaal (Parson) on Oct 29, 2004 at 07:07 UTC
    MD5 collisions are rare, but they can happen. If you want to be really safe, your storage should not just keep track of seen hashes; it should make them the key of a list of files that have those hashes. Then when you detect a seen hash, you should byte-compare the new file with all the existing files on that list.

    This, of course, is slower, adds complexity, and will rarely be useful; but personally, I want code that deletes files automatically to be correct!

      Agreed. You can make it a lot more efficient by stat()ing all the files and only bothering to compare the contents of those which are the same size. Another small improvement can come from noting that those with the same device number and inode number are guaranteed to be the same so no need to compare their contents, although this may not be portable to non-Unixy platforms.

      You should also be careful about how you compare symlinks and device files.

        And further improvement can be made by reading in just the first 1024 bytes or so, and calculate the md5 from that. Only if those match, you do a full comparison.
        Then again, hardlinks are less of a concern for cleanup, because they don't waste disk space.
Re: Remove Duplicate Files
by ihb (Deacon) on Oct 30, 2004 at 20:25 UTC
Re: Remove Duplicate Files
by jfroebe (Parson) on Oct 29, 2004 at 02:47 UTC

    I guess you could run "dups.pl /home/jason /home/jason" in lieu of "rm -rf /home/jason"... ;-)

    Jason L. Froebe

    Team Sybase member

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://403580]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-03-28 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found