file comparison using file open in binary mode.

Karger78 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: file comparison using file open in binary mode. by gmargo (Hermit) on Nov 27, 2009 at 18:07 UTC
Perhaps avoid reinventing the wheel and just use MD5 or SHA1 sums? See Digest or relatives.	[reply]
Re^2: file comparison using file open in binary mode. by Karger78 (Beadle) on Nov 27, 2009 at 18:11 UTC
I have tried MD5 etc, the problem is that it appears to include the name in the MD5, so if both files are equal but the name are different the MD5 hash appears to be different.	[reply]
Re^3: file comparison using file open in binary mode. by gmargo (Hermit) on Nov 27, 2009 at 18:15 UTC
That is not the case. Only the content matters. `gmargo@tesla 1368$ ls -l perl10.pl -rw-r--r-- 1 gmargo gmargo 1150 Oct 16 21:43 perl10.pl gmargo@tesla 1369$ cp perl10.pl perl10a.pl gmargo@tesla 1370$ md5sum -b perl10.pl perl10a.pl 976ed3393d1e967b2d8b4432c92b1397 perl10.pl 976ed3393d1e967b2d8b4432c92b1397 perl10a.pl gmargo@tesla 1371$ sha1sum -b perl10.pl perl10a.pl 1f62d2bcc8dcdf3f9ef0f3728f9f99c85eb21d81 perl10.pl 1f62d2bcc8dcdf3f9ef0f3728f9f99c85eb21d81 perl10a.pl` [download]	[reply] [d/l]
Re^4: file comparison using file open in binary mode. by Karger78 (Beadle) on Nov 27, 2009 at 18:19 UTC
Re^4: file comparison using file open in binary mode. by Karger78 (Beadle) on Nov 27, 2009 at 20:06 UTC
Re^5: file comparison using file open in binary mode. by zwon (Abbot) on Nov 27, 2009 at 20:35 UTC
Re: file comparison using file open in binary mode. by johngg (Canon) on Nov 27, 2009 at 22:59 UTC
Rather than making a simple hash of all files and their MD5s for each directory it would be more efficient if you made a HoA for each directory with the keys being file size and the values being anonymous arrays of files of that size. Then, rather than comparing MD5s of every file in one array against those in the other, you only need to compare sets of files of the same size; if the sizes differ the files can't be identical so there's no need to compare them! This has the effect of sharply reducing the number of comparisons you have to make. You could pare down each hash, removing keys that weren't common to both in order to remove files from consideration that could not possibly be duplicated. This gives another efficiency gain because an inode lookup to get the size of a file is much cheaper than calculating an MD5 sum and this method means you only have to do expensive MD5s if you have file sets of the same size in each hash that must be compared. Once you reach the comparison stage you could process the hashes a size at a time, following zwon's idea of creating hashes keyed by MD5 with anonymous arrays of filenames as the value. Or perhaps a HoHoA structure, something like `%fileset = ( '976ed3393d1e967b2d8b4432c92b1397' => { 'dirA' => [ 'fileA', 'fileC', ], 'dirB' => [ 'fileX', ], }, 'dc92b13976ed67b1e98b44322d339397' => { 'dirA' => [ 'fileG', ], }, );` [download] I hope these ideas are helpful. Cheers, JohnGG Update: Augmented the language in the first paragraph to make it clearer that it is the file size that determines whether a comparison is necessary or not.	[reply] [d/l]
Re^2: file comparison using file open in binary mode. by Karger78 (Beadle) on Nov 30, 2009 at 18:13 UTC
JohnGG, thanks for the idea. However, this is just on a file per file base as the file could have been renamed but the size is the same. This is just a small application, not it won't be doing a mass amount of files. This is what i have come up with thus far. I build the hash's which works great. But I am still stumped on the compare. So first I tried to figure out which hash is bigger to use as the foreach loop so it will go though all the files. However, this is still not working. There must be an easy way that I am missing to compare two hashs (specifically the values) and add the difference to another hash/array that I could use. my %hash1; my %hash2; my $hash1Count=0; my $hash2Count=0; foreach my $FL (@remoteFilelist) { push @md51,md5sum($FL); $hash1{$FL} = md5sum($FL); $hash1Count++; } foreach my $FL2 (@return) { push @md52,md5sum($logSite.$FL2); $hash2{$logSite.$FL2} = md5sum($logSite.$FL2); $hash2Count++; } if ($hash1Count >= $hash2Count) { foreach my $key ( keys %hash1 ) { if (!exists($hash2{$key})) { my $temp = $hash2{$key}; push (@finalCompareArray, $temp); } } }else{ foreach my $key ( keys %hash2 ) { if (!exists($hash1{$key})) { my $temp = $hash1{$key}; push (@finalCompareArray, $temp); } } } [download]	[reply] [d/l]
Re: file comparison using file open in binary mode. by gmargo (Hermit) on Nov 30, 2009 at 21:00 UTC
Since you seem to be making progress, and are perhaps just stumbling over some hash comparison issues, I took a whack at writing the code. I incorporated some of the ideas from zwon and johngg but mostly I just made it up as I went along. Note how I tried to break down the problem into smaller problems each handled by a subroutine. #!/usr/bin/perl -w use strict; use warnings; use diagnostics; # Given: # Two arrays of filenames with full path. # # Goal: # Find identical files between the two lists. Cannot compare by name. # # Strategy: # 1. Gather size information on every file. # 2. For any files between lists with identical sizes, gather digest i +nformation. # 3. For any files between lists with identical digests, print info. use Digest; # Lists of files to compare. my @FileList1 = populate_file_list1(); # Fill out file list somehow. my @FileList2 = populate_file_list2(); # Fill out file list somehow. # Find duplicate sizes. my (%Sizes1, %Sizes2); find_sizes(\@FileList1, \%Sizes1); find_sizes(\@FileList2, \%Sizes2); my @duplicate_sizes = find_duplicate_keys(\%Sizes1, \%Sizes2); # Create list of files to calculate the digest on. my (@SizedFileList1, @SizedFileList2); foreach my $size (@duplicate_sizes) { push @SizedFileList1, @{ $Sizes1{$size} }; push @SizedFileList2, @{ $Sizes2{$size} }; } # Find duplicate digests. my (%Digests1, %Digests2); find_digests(\@SizedFileList1, \%Digests1); find_digests(\@SizedFileList2, \%Digests2); my @duplicate_digests = find_duplicate_keys(\%Digests1, \%Digests2); # Print cross-directory digest duplicates. foreach my $digest (@duplicate_digests) { foreach my $file1 (@{ $Digests1{$digest} }) { foreach my $file2 (@{ $Digests2{$digest} }) { print "Duplicate found: $file1 => $file2\n"; } } } exit 0; #------------------------------------------------ # find_sizes # Given references to filelist and hash, # Store size of each regular file in the hash. # Store array of all regular files with same size # in the hash, keyed on the file size. #------------------------------------------------ sub find_sizes { # Pass reference to FileListN array and SizesN hash my ($filelist, $sizes) = @_; foreach my $file (@$filelist) { my @stats = lstat $file; # lstat the file next if ! -f _; # Ignore if not regular fi +le push @{ $$sizes{$stats[7]} }, $file;# Save filename with other +s of same size. } } #------------------------------------------------ # find_digests # Given references to filelist and hash, # Store digest of each regular file in the hash. # Store array of all regular files with same digest # in the hash, keyed on the file digest. #------------------------------------------------ sub find_digests { # Pass reference to SizedFileListN array and DigestN hash my ($filelist, $digests) = @_; foreach my $file (@$filelist) { my $dval = calc_digest($file); # Calculate digest on file +. next if !defined $dval; # Skip unreadable files. push @{ $$digests{$dval} }, $file; # Save filename with other +s of same digest. } } #------------------------------------------------ # find_duplicate_keys # Given references to two hashs, # Return array of keys that occur in both hashs. #------------------------------------------------ sub find_duplicate_keys { my ($hash1, $hash2) = @_; my %seen; $seen{$_}++ foreach keys %$hash1; $seen{$_}++ foreach keys %$hash2; return grep { $seen{$_} >= 2 } keys %seen; } #------------------------------------------------ # calc_digest # Given a filename, # Calculate a digest algorithm on the content. #------------------------------------------------ sub calc_digest { my ($file) = @_; my $fh; if (!open($fh, "<", $file)) { warn ("Cannot open $file: $!"); return undef; } my $ctx = Digest->new("MD5"); # Choose digest algorithm $ctx->addfile($fh); close $fh; return $ctx->hexdigest; } #------------------------------------------------ # populate_file_list1 # populate_file_list2 # # Fill file list. # # The code below is for testing only. #------------------------------------------------ sub populate_file_list1 { return glob("find.test.dir/dirA/"); } sub populate_file_list2 { return glob("find.test.dir/dirB/"); } [download]	[reply] [d/l]