in reply to Re^3: Comparing duplicate pictures in different directories
in thread Comparing duplicate pictures in different directories
If possible checksum collisions -- i.e. false-alarm matches -- is a concern, I think the potential for this is much reduced by supplementing the checksum with the file size (the likelihood of two files having the same size and same checksum, despite having different content, is comfortably small). With that, the old MD5 approach should suffice. So (untested):
Granted, if there are relatively few duplications among M masters and N files to test, then applying diff or File::Compare M*N times could be pretty quick. But if there are lots of masters that each have multiple duplications, then diff or File::Compare would have to do a lot of repeated full reads of files to find them all.use Digest::MD5 qw(md5_base64); # assume @files contains paths to all files (masters and possible dups +) my %cksum; for my $file ( @files ) { my $size = -s $file; local $/; open( I, $file ); my $md5 = md5_base64( <I> ); push @{$cksum{"$md5 $size"}}, $file; } for ( keys %cksum ) { print "dups: @{$cksum{$_}}\n" if ( @{$cksum{$_}} > 1 ); }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Comparing duplicate pictures in different directories
by merlyn (Sage) on Jun 20, 2005 at 00:34 UTC |