in reply to Scanning for duplicate files
I think size is too loose a selector for duplicate files. Coincidence is more likely than you may think for a collection of files with common format, stereotyped content, or small size. Since you want to unlink dupes, it would be advisable to play safe.
An md5 digest is a better indicator. Here is one way to use it:
This is fairly idiomatic. The first two statements construct a hash of arrays. The arrays contain filenames duplicates, indexed by checksum. For each distinct md5 digest, we unlink the list of extra files pruned by splice.my %cksums; push @{$cksums{`md5sum $_`}}, $_ for glob($dir/*); unlink( splice @{$cksums{$_}}, 1) || die $! for keys %cksums;
After Compline,
Zaxo
|
|---|