idnopheq has asked for the wisdom of the Perl Monks concerning the following question:

How do I verify that a copied file's contents did not corrupt/modify?

I need something more concrete than File::Compare or diff or fc.exe, for reasons too innane to delve into. Just know it is political. Anyway, I was thinking of something like an MD5 hash, and I tried using Digest::MD5 to do it. But the hash returned is different, even though File::Compare and diff report no differences (on flat text ... i.e. 'Hello, World'. Must get a seed from stat or directory or something? Or I'm doing something wrong, like hammering a screw?

The issue stems from a problem with a few boxes I assist administering. When copying files over NFS to one host or via SMB, both originating from Unix (Samba) or NT (Exceed) to the other, occasionally the destination file ends up with the correct file size, but has white space replacing large chunks of the file. 95% of the various and mixed boxes exhibit no issues.

The customer, and I for sanity, wants something akin to a CRC or checksum involved in the copy process. See my sample below.

THX
Dex\

#!/usr/bin/perl -w # -*-Perl-*- use strict; use FileHandle; use Digest::MD5; my $sourceFile = $ARGV[0]; my $destFile = $ARGV[1]; my $inFile = new FileHandle; my $outFile = new FileHandle; my $inMD5 = Digest::MD5->new; my $outMD5 = Digest::MD5->new; my ( $fileLength, $fileBuffer, $fileOffset ); $inFile->open ( "<$sourceFile" ) or die "Could not open $sourceFile:$!\n"; $inMD5->addfile ( $inFile ); $outFile->open ( ">$destFile" ) or die "Could not open $destFile:$!\n"; print $inMD5->md5_base64 , "\n"; # borrowed from "Programming Perl" my $blockSize = ( stat $inFile )[11] || 16384; while ( $fileLength = sysread $inFile, $fileBuffer, $blockSize ) { if ( !defined $fileLength ) { next if $! =~ /^Interrupted/; die "System read error: $!\n"; } my $fileOffset = 0; while ( $fileLength ) { my $written = syswrite $outFile, $fileBuffer, $fileLength, $fileOffse +t; die "System write error: $!\n" unless defined $written; $fileLength -= $written; $fileOffset += $written; }; } $outMD5->addfile ( $outFile ); print $outMD5->md5_base64 , "\n"; $inFile->close; $outFile->close;

2001-03-14 Edit by Corion: Moved the explanation of the problem up from a reply into the root node.

  • Comment on How do I verify that a copied file's contents did not corrupt/modify?
  • Download Code

Replies are listed 'Best First'.
Re: How do I verify that a copied file's contents did not corrupt/modify?
by AgentM (Curate) on Mar 14, 2001 at 07:08 UTC
    You can easily confirm that the two files are the same by using File::Compare.
Re: How do I verify that a copied file's contents did not corrupt/modify?
by grinder (Bishop) on Mar 14, 2001 at 13:05 UTC
    Off the top of my head, if you're not worried about exactly which bytes differ, I would take the MD5 digests of the files and see if these differ. Something like the following will work:
    #! /usr/bin/perl -w use strict; use Digest::MD5; my $orig = shift or die "no original given\n"; my $dup = shift or die "no copy given\n"; open ORIG, $orig || die "Cannot open $orig for input: $!\n"; my $origmd5 = Digest::MD5->new; my $origdig = $origmd5->addfile( *ORIG )->digest; close ORIG; open DUP, $dup || die "Cannot open $dup for input: $!\n"; my $dupmd5 = Digest::MD5->new; my $dupdig = $dupmd5->addfile( *DUP )->digest; print( $origdig ne $dupdig ? 'not ' : '', "ok\n" );
      Tried your code and I am seeing the same thing, even on the localhost's own file system. Interrestingly I tried it on two seperate copies, /opt/hello and /tmp/hello, and those two DO compare via Digest::MD5! Neither's hash matches that of the original. This is true on W2K, GNU/Linux, and Solaris, perl 5.6.0.
Re: How do I verify that a copied file's contents did not corrupt/modify?
by idnopheq (Chaplain) on Mar 14, 2001 at 08:50 UTC

    2001-03-14 Edit by Corion : The content of this node was moved into the root node.

      You might wanna add some sysseek's to make sure Digest::MD5 reads the whole file. Moreover, you have the outfile only opened to write (prepend a plus to the opening string of destFile.

      Well, that doesn't explain the outcome of the comparison. (have to run....hint: OOP versus function)

      Jeroen
      "We are not alone"(FZ)

      Update: I modified your code, and it works. Tested on linux. BTW, yer original code MD'd the objects!

      #!/usr/bin/perl -w # -*-Perl-*- use strict; use FileHandle; use Digest::MD5; my $sourceFile = $ARGV[0]; my $destFile = $ARGV[1]; my $inFile = new FileHandle; my $outFile = new FileHandle; my $inMD5 = Digest::MD5->new; my $outMD5 = Digest::MD5->new; my ( $fileLength, $fileBuffer, $fileOffset ); $inFile->open ( "<$sourceFile" ) or die "Could not open $sourceFile:$!\n"; $inMD5->addfile ( $inFile ); print $inMD5->b64digest , "\n"; $outFile->open ( "+>$destFile" ) or die "Could not open $destFile:$!\n"; # borrowed from "Programming Perl" die "Could not rewind $sourceFile: $!" unless defined sysseek $inFile, + 0, 0; my $blockSize = ( stat $inFile )[11] || 16384; while ( $fileLength = sysread $inFile, $fileBuffer, $blockSize ) { if ( !defined $fileLength ) { next if $! =~ /^Interrupted/; die "System read error: $!\n"; } my $fileOffset = 0; while ( $fileLength ) { my $written = syswrite $outFile, $fileBuffer, $fileLength, $fileOffse +t; die "System write error: $!\n" unless defined $written; $fileLength -= $written; $fileOffset += $written; }; } die "Could not rewind $destFile: $!" unless defined sysseek $outFile, +0, 0; $outMD5->addfile ( $outFile ); print $outMD5->b64digest , "\n"; $inFile->close; $outFile->close;
      Moreover, the documented behaviour of an automagical reset right after the digest, seems not to work, at least on b64digest and my system. Something to watch for.
        Works like a charm! A few questions to help illuminate my mind, if I may.

        I'm looking at the perlfunc entry for open, and I remain a little hazy regarding the subtle difference between ">" truncating and "+>" clobbering. I'm digging through the holy texts and references now, but if you care to pass more wisdom my way ...

        The sysseek is a nice touch. I take it that since $outFile was not at the beginning of the file when I passed the filehandle to Digest::MD5, it somehow changed the hashing algorithm's output? Like only hashing from the current position to the end?

        THX!
        Dex

Re: How do I verify that a copied file's contents did not corrupt/modify?
by Anonymous Monk on Mar 14, 2001 at 21:44 UTC
    /usr/bin/md5sum

    Just use that. Let it worry about details. The -c flag is especially useful.

    Don't reinvent the wheel :)
      Little after-the-fact on my part, but that won't do for the NT/2K hosts, unless anyone knows of a port to Win32. Of course, one ~could~ do a pure-perl version and offer it in humble homage to the PPT project. Hmmmm.

      HTH
      --
      idnopheq
      Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.