http://qs1969.pair.com?node_id=775145

Karger78 has asked for the wisdom of the Perl Monks concerning the following question:

I need some help trying to figure out how I am going to do a file comparison. Here is what my script does. It goes to a log share, files the files associated to a certin case number, load's it into an array. Then it appends a time stampt to the name of the files, then copies it over to a local PC path along with a network share path. What I would like to do is compare the orignal file against the file that was copied over to the network path to ensure the files were actually copied over for error checking, then after the file check is complete, remove the files off the submission site. This is check is nescarry and critial before removing the orignal files off the submission share. Does anyone have a good suggestion on how I could do a comparison between files even though the file name as changed. The file size is the same. I had an idea of bit by bit comparison, or somthing of that nature. Any ideas?

Replies are listed 'Best First'.
Re: file comparison
by Perlbotics (Archbishop) on Jun 26, 2009 at 19:21 UTC

    You could compute and compare hash-digests (e.g. Digest::MD5) of files having identical size. The probability that two different files have the same digest is very low. If you decide for a bit-by-bit comparison, you should do that on rather big chunks of data. See sysread to get an idea.

    Update: As a measure of precaution, you should also take care to really read the files from disk or network but not from the OS's file cache. Don't know how to do that using Windows, though. *nix has sync

    Update2: As a response to this node below: You could try something along...

    use strict; use Digest::MD5; sub get_md5 { my $file = shift; open (my $fh, '<', $file) or die "cannot open $file - $1"; binmode($fh); my $md5 = Digest::MD5->new; $md5->addfile($fh); close($fh) or die "cannot close $file - $!"; return $md5->hexdigest; # TODO: think about caching results... } sub files_equal_by_md5 { my ($file1, $file2) = @_; # files differ in size? return 0 if (-s $file1 != -s $file2); my $digest1 = get_md5($file1); my $digest2 = get_md5($file2); return $digest1 eq $digest2 ? 1 : 0; } die "usage: $0 file1 file2\n compares file1 and file 2\n" unless @ARGV +==2; print files_equal_by_md5($ARGV[0],$ARGV[1]) ? "files are equal" : "different files", "\n";
    HTH

Re: file comparison
by ramlight (Friar) on Jun 26, 2009 at 19:42 UTC
    If you are doing this a lot with large files, computing a cryptographic hash is the way to go. If, on the other hand you are:

    1) Working on Windows
    2) Working over network shares
    3) Not doing too much

    This little snippet I threw together the other day to compare the contents of two shares might help:

    my $share = '\\\\someserver\\someshare\\"' foreach my $file (@file_list) { my $file2 = $share . $file; if (-e $file2) { my @cmp = `fc $file $file2 | findstr /c:"FC:"`; if ($cmp[0]) { unless ($cmp[0] =~ /no differences/) { print "$file: $cmp[0]"; } } else { print "$file:\n" } } else { print "$file is missing from $share\n"; } }
      Thanks for your method. i am doing a file compare on windows. will your snip work even if the file names are different but the size is the same?
        Yes. You just have to feed the appropriate file names to the fc command and look at what it gives you back. It so happened that I was comparing two shares to see where they differed, not quite the same as your problem ... but close enough.
Re: file comparison
by Karger78 (Beadle) on Jun 26, 2009 at 20:30 UTC
    Ok now if I go with the md5 way how could I accomplish this? here is the code I have started.
    opendir(DIR, $RemoteSubDirectory); my @rFileCheck = readdir(DIR); closedir(DIR); opendir(DIR, $localCpPath); my @lFileCheck = readdir(DIR); closedir(DIR); my $c1; foreach (@lFileCheck) { print md5_base64($lFileCheck[$c1]); print "\n"; $c1++; } my $c2; foreach (@rFileCheck) { print md5_base64($rFileCheck[$c1]); print "\n"; $c2++; }
      You have a ways to go yet. Your foreach loops won't do what you want, for a few different reasons:
      • The parameter you pass to md5_base64 needs to be the data in the file, not the name of the file.
      • When you read a file to get its md5, you need to include the path name with the file name (because readdir only returns the file name, not the path).
      • Rather than just printing the md5s, you should store them, compare them, and print (and/or act on) the results of the comparisons.

      Apart from that, your loop usage could be a little better. Also, I think it ends up being easier to use the "object" style interface to Digest::MD5. You still have to open each file, but then you can just pass the file handle to the module.

      Here's an approach that includes checking file size in combination with the md5 checksum, and reports 3 different problem cases that might come up:

      #!/usr/bin/perl use strict; use warnings; use Digest::MD5; die "Usage: $0 remoteDir localDir\n" unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ); my ( $remote, $local ) = @ARGV; my %md5; my $digest = Digest::MD5->new(); for my $dir ( $local, $remote ) { opendir DIR, $dir or die "$dir: $!\n"; while ( my $f = readdir( DIR )) { next unless -f "$dir/$f"; if ( open( my $fh, "<", "$dir/$f" )) { $digest->new; $digest->addfile( $fh ); $md5{$f}{$dir} = join( " ", -s _, $digest->b64digest ); } else { warn "Open failed for $dir/$f: $!\n"; } } } for my $file ( sort keys %md5 ) { if ( $md5{$file}{$remote} and ! $md5{$file}{$local} ) { warn sprintf( "%s: found on remote, not found in local path\n" +, $file ); } elsif ( $md5{$file}{$remote} ne $md5{$file}{$local} ) { warn sprintf( "%s: remote/local difference: %s vs. %s\n", $fil +e, $md5{$file}{$remote}, $md5{$file}{$local} ); } else { unlink "$remote/$file" or warn sprintf( "%s: unable to delete remote copy: %s\n", $f +ile, $! ); } }
      (updated to remove an unnecessary "next" from the latter for loop. Also added error checking when reading the files for their md5s.)
      one thing to consider in the below code is that a "directory" is a file. This means the "." and ".." are files too! I congratulate you on using readdir rather than "globbing". This is much more portable and is the right way to go. I would put a grep to filter to the "real files",
      opendir(DIR, $RemoteSubDirectory); my @rFileCheck = grep {-f $RemoteSubDirectory/$_ }readdir(DIR);
      Remember that readdir only gives file names and you have to add the path... Your code:
      opendir(DIR, $RemoteSubDirectory); my @rFileCheck = readdir(DIR); closedir(DIR);
Re: file comparison
by Marshall (Canon) on Jun 27, 2009 at 03:24 UTC
    I would recommend setting up an SSL (Secure Socket Link). One link with some definitions is: http://en.wikipedia.org/wiki/Secure_Sockets_Layer

    There are many ways of doing this. The main point is that your thinking (algorithm) is not correct. There will be essentially be an encrypt function and a decrypt function.

    A layer on top of that is that the number of bytes sent == number bytes received by some protocol. This is like receiving a binary file with a "checksum".

    By definition of an encrypted link, there is NO comparison of sent byte 1 to received byte 1. Ie, I sent a "secret message", you get it and decrypt it, but there is never any discussion about whether my char 3 is the same as char 3 that you sent. In fact, I may even send more encrypted bits (or even less) than you sent me to encrypt to begin with!

    Typically the encrypted message is "transported" via standard protocols that show the num bytes sent, etc. You "got" the encrypted message by: 1)transport layer says that it is ok (number of bytes sent and received are the same and reasonable probability that no transmission error occured, 2)decrypted values "make sense".

Re: file comparison
by Anonymous Monk on Jun 26, 2009 at 19:32 UTC
    can't you just use cmp from *nix systems ? or perhaps cmp
      what about using md5
        yes, that works too, but as Perlbotics said, there is a chance to find different files with the same digest(hash).
        maybe you could use a bigger digest like SHA-1