Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks I have this code (from here which was done for me, I'd log in but the system won't send me a new password) The question, does the code look correct for finding duplicate files, based on the output I'm getting it looks like it is checking size instead of name first? And if I may a second question, how do I output paths with the slash leaning the Windows way? The output is e.g. c:/Temp\testfolder I'm using the code through a C# GUI and the output needs to be \ for the whole path spat out.
#!/usr/bin/perl -w use strict; use File::Find; # No warnings so the output doesn't state which folders it # can't access (Windows issue) no warnings 'File::Find'; use Digest::MD5; local $| = 1; #**************** File Scope Variables ********************* my $path = $ARGV[0]; my $testpath = 'C:/Temp/'; #*********************************************************** my %files; my $wasted = 0; print "Searching for duplicate files on $ARGV[0]\n"; find(\&check_file, $path); #find(\&check_file, $ARGV[0] || $usbstick); local $" = ","; foreach my $size (sort {$b <=> $a} keys %files) { next unless @{$files{$size}} > 1; my %md5; foreach my $file (@{$files{$size}}) { open(FILE, $file) or next; binmode(FILE); push @{$md5{Digest::MD5->new->addfile(*FILE)->hexdigest}},$file."\ +n"; } foreach my $hash (keys %md5) { next unless @{$md5{$hash}} > 1; print "\n@{$md5{$hash}}"; print "File size $size"; $wasted += $size * (@{$md5{$hash}} - 1); } } 1 while $wasted =~ s/^([-+]?\d+)(\d{3})/$1,$2/; print "\n$wasted bytes in duplicated files\n"; sub check_file { -f && push @{$files{(stat(_))[7]}}, $File::Find::name; }

Replies are listed 'Best First'.
Re: File::Find duplicate question
by Loops (Curate) on Oct 24, 2014 at 23:06 UTC

    Hi there. It makes sense to only compare files of the same size so that is why they are compared and output in that order, within each size the output is based on a traversal of a Perl hash, which is essentially random. In order to force all '/' Characters to '\' in the output names, you could change the check_file sub to:

    sub check_file { (my $fn = $File::Find::name) =~ tr#/#\\#; -f && push @{$files{(stat(_))[7]}}, $fn; }
      Thanks for replying, It must just be my logic - I really thought that you'd check the name first of files, store them in the array then find duplicate names and then MD5 them to see if they are indeed the same. Thanks for the little piece of code.
      (my $fn = $File::Find::name) =~ tr#/#\\#;
      I have a few scripts using File::Find this is handy.
Re: File::Find duplicate question
by crashtest (Curate) on Oct 24, 2014 at 23:09 UTC

    "Does it look correct?" Sure! But that's easy for me to say. Can't you test it on some examples and see if it gives you your expected output? If it doesn't, then you have a specific example where it goes wrong, and it would be easier to help you.

    The program first groups files with identical sizes, then compares their MD5 hashes to identify duplicates, starting with the largest files first. So yes, it's checking file sizes first, so it can report the worst space hogs quickly for you. Seems like a nice feature.

    As for getting Windows paths, a simple approach would be to do a substitution on the file name in line 28:

    push @{$md5{Digest::MD5->new->addfile(*FILE)->hexdigest}}, $file =~ tr!/!\\!r ."\n"; # <-- instead of: $file . "\n"

      Thanks for replying
Re: File::Find duplicate question
by 2teez (Vicar) on Oct 25, 2014 at 02:48 UTC

    ..And if I may a second question, how do I output paths with the slash leaning the Windows way?
    Instead of doing it yourself, use either "canonpath" from File::Spec

    ... use File::Spec; ... sub check_file{ -f && push ... , File::Spec->canonpath($File::Find::name); }
    or Path::Tiny
    ... use Path::Tiny; ... sub check_file{ -f && push ... , path($File::Find::name)->canonpath}); }
    More so, with -w on your shebag line, no warning as used wouldn't work. You probably want to use warnings; instead.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
      Thanks for replying, I didn't know about File::Spec nor Path::Tiny I'm not sure I have access to those modules a work but I'll check it out.

        You may probably have to install Path::Tiny, but you definitely have File::Spec which has been in perl CORE since perl 5.00405, am sure you have better.
        However, need I say you somehow need look at "Path::Tiny".

        If you tell me, I'll forget.
        If you show me, I'll remember.
        if you involve me, I'll understand.
        --- Author unknown to me