myforwik has asked for the wisdom of the Perl Monks concerning the following question:

What is the easiest way to hash only the first 1MB of large number of huge files? I have been using Digest::MD5::File, is there something like that where you can specify the maximum amount to read from each file?

Replies are listed 'Best First'.
Re: Hash only portion of a file
by Kenosis (Priest) on Jan 19, 2014 at 03:05 UTC

    Another option is to local $/ = \1_048_576; to read the first 1MB:

    use strict; use warnings; use Digest::MD5 qw/md5_hex/; my $md5_hex_1m = file_md5_hex_1m('File.txt'); print $md5_hex_1m; sub file_md5_hex_1m { my ( $file, $contents ) = @_; local $/ = \1_048_576; open my $fh, '<', $file or die $!; return Digest::MD5->new()->add($contents)->hexdigest if defined( $contents = <$fh> ); }
      Can you please explain this? I was under the impression $/ only matches a character or number of characters, how does it match 1 million counts of characters?

        That's the way it works. Please see  $/ in perlvar. (AKA  $INPUT_RECORD_SEPARATOR and  $RS. See also
            perldoc -v "$/"
        from the (update: Windoze) command line — use single-quotes in *nix (update: but see choroba's reply below).)

        "Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer number of characters"

        This is really cool. I also didn't know it, because i didn't read the whole manpage ;-(

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Re: Hash only portion of a file
by myforwik (Acolyte) on Jan 19, 2014 at 02:08 UTC
    I ended up just making my own verison of file_md5_hex that I was using:
    sub file_md5_hex_1m { my ($file,$bn,$ut) = @_; my $fh = $getfh->($file) or return; my $cnt = 0; my $md5 = Digest::MD5->new(); my $buf; while(my $l = read($fh, $buf, 1024)) { $md5->add( $buf ); $cnt++; if ($cnt == 1024) {last;} } return $md5->hexdigest; }
Re: Hash only portion of a file
by Jim (Curate) on Jan 19, 2014 at 16:33 UTC

    An earnest question, myforwik:  For what application is it useful to compute the digests (hashes) of arbitrary small portions of large digital objects? In what context are the hashes of just the first one megabytes of multiple files meaningful and helpful?

    If you're doing this for the purpose of file deduplication, you may want to consider other strategies besides hashing just parts of files.

    Jim

      I doubt it has any use - except for testing. I wanted my script to run faster while testing instead of hashing entire files.