delete duplicates in windows subdirectories

casimo has asked for the wisdom of the Perl Monks concerning the following question:

the following code taken directly from http://www.perlmonks.org/index.pl?node_id=111982 works fine

use Digest::MD5;
use File::Find;

$|=1; #Autoflush ON!

my @list;
my %dupes;
my @delete;
my %digests;
my $ctx = Digest::MD5->new;

sub check_file {
    my $file=shift;

    $ctx->reset;
    open FILE,$file || die "Cant open $file!\n";
    binmode FILE;
    $ctx->addfile(*FILE);
    close FILE;
    my $digest = $ctx->hexdigest;

    if (exists($digests{$digest})) {
        print "\t$file is a dupe!\n";
        $dupes{$digest}->{$file}=1;
        push @delete,$file;
    } else {
        $digests{$digest}=$file;
    }
}

#CHANGE ME!!!
my $path='C:\Documents and Settings\user\Desktop\dir\subdir1\2010\01\0
+1';

print "I am going to look for duplicates starting at ".$path."\n";
find({wanted=>sub{if (-f $_) {check_file($_)} 
                  else {print "Searching $_\n"}},
      no_chdir=>1},$path);

print "There are ".@delete." duplicate files to delete.\n";

# Uncomment the below line to lose the duplicates!
print "Deleted ".unlink(@delete)." files!";

}
[download]

what I am trying to do is modify the code to loop through each subdirectory and am having a problem:

@directories = grep -d,<*>;
foreach $directory (@directories) {

#CHANGE ME!!!
my $path='C:\Documents and Settings\user\Desktop\dir\\' . $directory .
+ '\2010\01\01';

}
[download]

I get the duplicate message but the files are not being added to the @delete array. I feel like I'm missing something with the backslashes but can't seem to figure out the fix.

any ideas?

Comment on delete duplicates in windows subdirectories Select or Download Code

Replies are listed 'Best First'.
Re: delete duplicates in windows subdirectories by graff (Chancellor) on Jan 02, 2010 at 18:29 UTC
Why do you have two back-slashes next to each other after the "\dir"? What is the current working directory for the shell that runs your perl script? You are using the glob operator in the script's current working directory in order to get the list of directories to search through, so (if windows works like any other normal OS) you shouldn't need to include all that initial stuff (`'C:\Documents and Settings\user\Desktop\dir\'`) when assigning a value to $path. Just put the value of $directory at the beginning of $path. Apart from that, are you certain that every directory (or any directory) contained in the script's current working directory actually contains a sub-path called '2010\01\01'? When you say "I get the duplicate message but the files are not being added to the @delete array.", do you mean the output is "There are 0 duplicate files to delete"? If so, the previous questions are relevant. Do you see the outputs that say "Searching (path_string)"? If so, have you checked and made sure that the path strings in those messages are paths that actually exist? If so, do those paths actually contain any duplicate files? If the answer is "no" to any of that, then the script would seem to be working, in the sense that it can't find any duplicate files to delete. Update: BTW, your use of grep on the glob is not a good idea, I think, because it allows both "." and ".." to pass through into your @directories array. A better usage would be: `@directories = grep { /[^.]/ && -d } <*>; # list items must be direct +ories whose # names contain something b +esides '.'` [download] (updated again to put the regex match in front of the more expensive "-d" function call)	[reply] [d/l] [select]
Re^2: delete duplicates in windows subdirectories by casimo (Sexton) on Jan 02, 2010 at 22:07 UTC
Why do you have two back-slashes next to each other after the "\dir"? I believe without it it would have just been protecting the quote instead of being a literal backslash What is the current working directory for the shell that runs your perl script? it is "dir", so it doesn't need to use the absolute path Apart from that, are you certain that every directory (or any directory) contained in the script's current working directory actually contains a sub-path called '2010\01\01'? the directories are based on the current date and are always there When you say "I get the duplicate message but the files are not being added to the @delete array.", do you mean the output is "There are 0 duplicate files to delete"? If so, the previous questions are relevant. yes Do you see the outputs that say "Searching (path_string)"? yes	[reply]
Re: delete duplicates in windows subdirectories by Anonymous Monk on Jan 02, 2010 at 18:21 UTC
I feel like I'm missing something with the backslashes `#CHANGED ME!!! my $path="C:\\Documents and Settings\\user\\Desktop\\dir\\$directory\\ +2010\\01\\01";` [download] (untested)	[reply] [d/l]
Re^2: delete duplicates in windows subdirectories by casimo (Sexton) on Jan 02, 2010 at 21:58 UTC
I tried this and it doesn't get pushed to the array. I'm not sure if paths with spaces (" ") cause the push to fail there was also where the file path is added and contains a forward slash instead of a backslash but I made a regex to flip it and it still didn't add to the array.	[reply]
Re^3: delete duplicates in windows subdirectories by cdarke (Prior) on Jan 03, 2010 at 17:22 UTC
You need to start tracing the check_file subroutine. There is at least one questionable line in it: `open FILE,$file \|\| die "Cant open $file!\n";` [download] has a precedence problem. If the open fails then it will not execute the die (and we not not saying why the open failed). A better way is: `open (FILE,$file) \|\| die "Cant open $file: $!\n";` [download] or `open FILE,$file or die "Cant open $file: $!\n";` [download] or even better `open (my $fh, '<', $file) \|\| die "Cannot open $file: $!";` [download] Then change the occurences of `FILE` to `$fh`. By the way, the extra '\' are required since '\' is a special character in perl strings (as in many languages). In most cases you can use '\\' or '/' in a Windows filename path, and even mix them in the same path. You probably do not need that RE.	[reply] [d/l] [select]