Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Checksumming dynamically created media (code)

by deprecated (Priest)
on Jul 16, 2001 at 19:48 UTC ( [id://97059]=perlquestion: print w/replies, xml ) Need Help??

deprecated has asked for the wisdom of the Perl Monks concerning the following question:

I have a series of scripts that controls our dvd jukebox/burner. When we burn a DVD, we burn two copies of it so we have a "mirror" redundant copy. We burn them in parallel and then checksum them afterwards to make sure that theyre both okay. We do this because sometimes if the load exceeds 2.0 on the machine (in this case an ultra 10), the burn may fail. Also, bad media is not entirely unheard of. This process was working fine, and took about 2 hours to burn and 2 hours to verify on a 4x burner. However, last night, something came up that proved our process needed tuning.

This is the shell script we're using at the moment:

DVD=$1 B=`/usr/sbin/mount | grep "$DVD on` if [ "$B" = "" ] then echo "DVD is not mounted. Please mount and then try again" exit fi nohup find /dvd/$DVD -type f -exec cksum {} \; >$CHECKDIR/cksum.$DVD.d +vd &
This, like I said, has been working. The problem arose when this particular dvd contained about 11,000 files. For some reason, cksum (1) is rather slow.

I knew that perl had some features to do this with Digest::MD5 (it is used in MP3::Napster, which I use a lot). I also figured I could use File::Find to recursively traverse the directories like the find (1) command above is doing. My hope was that the implementation of the checksum in cksum was slower than the checksumming in Digest::MD5, and also that the find in File::Find was quicker than that in find (1).

So I havent benchmarked it yet, but here is the code I intend to use to replace the code we're using:

#!/usr01/aja96/perl/bin/perl use warnings; use strict; use Carp; use File::Find; use Digest::MD5 qw{ md5_hex }; use File::Slurp; my $dir = shift || '.'; my $debug = ''; my @cksums; sub wanted { my $file = $_; return if (-d $file); carp "cksumming $_ ($file)\n" if $debug; my $noodle = read_file( $file ) or croak "$file unreadable: $!\n"; my %file = ( name => $file, cksum => md5_hex( $noodle ), ); push @cksums, \%file; carp "$file checksummed\n" if $debug; } find( { wanted => \&wanted, follow_fast => 1 }, $dir ); print scalar @cksums, " checksums gleaned\n";
So, I have a couple of questions. Ideally, I'd like to be able to checksum the whole volume rather than each and every file. Is this somehow possible? I seem to remember reading somewhere that it was possible to checksum a volume at a time rather than each file. Also, how can I get more speed out of this? I need to go over 4.4gb at a time, and it gets rather slow when the file count rises.

<!- tilly need not reply. -> brother dep

--
Laziness, Impatience, Hubris, and Generosity.

Replies are listed 'Best First'.
Re: Checksumming dynamically created media (code)
by bikeNomad (Priest) on Jul 16, 2001 at 20:07 UTC
    You can checksum the whole volume by not mounting it and just reading the /dev/whatever raw contents as a stream of bytes. This should be much faster than traversing a file structure.

    However, Digest::MD5 is going to be much slower than cksum. It does more work. If you want to generate a CRC in Perl rather than by calling cksum, you might want to look into Compress::Zlib, String::CRC, or String::CRC32, which all provide a CRC function.

Re: Checksumming dynamically created media (code)
by Corion (Patriarch) on Jul 16, 2001 at 20:10 UTC

    As you are burning and checking complete DVDs, you might want to consider simply reading and checking the device instead of the mounted volume. If you have a SCSI DVD (likely considering your Sun hardware), this could be under /dev/sdc or maybe /dev/dvd. Simply opening that one as a file and reading from it could work, at least it works that way under Linux with harddisks.

      This is a great idea.

      Can I use seek on a raw device? The problem we run into if we're going to read the whole device, of course, is that the system (burly as it may be) does not have 4.5gb of ram, and 3.5gb of swap is unweildy. So we can't just slurp the whole disk in and checksum it... I was thinking that if we can read in the first gig and the last gig and provide two checksums, we can be reasonably assured that the middle 2.5 gig are okay.

      dep.

      --
      Laziness, Impatience, Hubris, and Generosity.

        Compress::Zlib's crc32 function can update a CRC in pieces; you can call it with an existing CRC to update it. So you read a buffer full at a time and update your CRC:

        use Compress::Zlib my $crc = 0; while ($fh->read($buffer)) { $crc = Compress::Zlib::crc32($buffer, $crc); }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://97059]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-26 03:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found