2ge has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I came to problem, when I wanted convert python code to perl. It is about calculating custom hash of file, in general it should do following: filesize + 64bit chksum of the first and last 64k (even if they overlap because the file is smaller than 128k) Source codes for other languages are here I am not good in bits operation, pack, unpack and uint64...but I did try, and here is what I came up with:
#!/usr/bin/perl -w use strict; die "need file as parameter" unless my $file = $ARGV[0]; open(my $fh, $file) or die "Can't open file $file: $!"; die "File $file is too small!" if -s $file < 65536; my $hashstring = -s $file; #calculate first 64kb for(1 .. 65536/8) { read($fh, my $byte, 4); my $ll = unpack("LL", $byte); #here is missing code } #calculate last 64kb seek $fh, -65536, 2; for(1 .. 65536/8) { read($fh, my $byte, 4); my $ll = unpack("LL", $byte); #here is missing code } close($fh); printf("Hash of $file is: %016x", $hashstring);
Please, help me, so I will get the same hash using perl. Thank you monkers!

Replies are listed 'Best First'.
Re: Creating custom hash of file
by grinder (Bishop) on Mar 18, 2007 at 11:22 UTC

    Judging by the linked-to code, the hash is just summing the ASCII values of each byte. Translating this to fiarly idiomatic Perl, I come up with the following:

    use strict; use warnings; use Fcntl ':seek'; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; $hash += ord($_) for split //, readline($in); seek($in, SEEK_END, -$chunk); $hash += ord($_) for split //, readline($in); close $in; return sprintf '%016x', $hash; } print calc($_), "\t$_\n" for @ARGV;

    ...but, as GrandFather points out, without a test file and expected results, it's hard to know if there are any bugs in this.

    update: I see you've added a test case. And I see that it's actually summing 32-bit values. The following code reflects that fact.

    use strict; use warnings; use Fcntl ':seek'; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; $hash += $_ for unpack 'N*' , readline($in); seek($in, SEEK_END, -$chunk); $hash += $_ for unpack 'N*' , readline($in); close $in; return sprintf '%016x', $hash; } my $test = 'test.deleteme'; open my $out, '>', $test; print $out ' ' x (65536*4); close $out; print calc($test), $/; unlink $test;

    Hmm, but that doesn't match the test results. Back to the drawing board.

    2nd update: Got it! It's summing quads, not longs, and clamping on overflow.

    I had to use bigint, which is sub-optimal in that it gets the correct results at the cost of vastly increased execution time. Still, it does the job, and on a 64-bit CPU you should be able to remove the 'use bigint' and it will work just the same, and much, much faster.

    use strict; use warnings; use Fcntl ':seek'; use bigint; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; for my $quad (unpack 'q*' , readline($in)) { $hash += $quad; $hash &= 2 ** 64 - 1; } seek($in, SEEK_END, -$chunk); for my $quad (unpack 'q*' , readline($in)) { $hash += $quad; $hash &= 2 ** 64 - 1; } close $in; return sprintf '%016x', $hash; } open my $out, '>', 'test.deleteme'; print $out ' ' x (65536*4); close $out; print calc('test.deleteme'), $/;

    • another intruder with the mooring in the heart of the Perl

      I had to use bigint, which is sub-optimal in that it gets the correct results at the cost of vastly increased execution time.
      Only if you don't use an XS library to do the calculations. Math::BigInt supports several, including Bit::Vector (Math::BigInt::BitVect), Math::Pari (Math::BigInt::Pari), and Math::GMP (Math::BigInt::GMP).
      Judging by the linked-to code, the hash is just summing the ASCII values of each byte.
      Hmm, I thought unpack supported braindead checksumming out of the box... something to do with a "%" sign, apparently. It might suffice to read a 64k block and apply unpack to it, with the proper format.
        The unpack format would have to be "%64q*". That still requires 64-bit support from Perl.

        Anno

      Grinder,

      thank you very much for your code, but in my system (win xp, 32 bit), perl doesn't recognize quads :( I tried with LL, but it computes really slow, and I still get '00000000ffffffff' Could you please improve your code, not to use quads and bigint ?

        Hmm, well, it's going to be hard to get rid of 'bigint' if you want 64-bit arithmetic to work correctly :(

        If your perl doesn't know about pack 'q', then you can roll it yourself with the following, but it will be even slower:

        use strict; use warnings; use Fcntl ':seek'; use bigint; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; my @val = unpack 'L*' , readline($in); for (my $j = 0; $j < $#val; $j += 2) { $hash += ($val[$j] << 32) + $val[$j+1]; $hash &= 2 ** 64 - 1; } seek($in, SEEK_END, -$chunk); @val = unpack 'L*' , readline($in); for (my $j = 0; $j < $#val; $j += 2) { $hash += ($val[$j] << 32) + $val[$j+1]; $hash &= 2 ** 64 - 1; } close $in; return sprintf '%016x', $hash; } open my $out, '>', 'test.deleteme'; print $out ' ' x (65536*4); close $out; print calc('test.deleteme'), $/;

        Perl's probably not the best language for this. The code is much shorter than the other languages, which is usually the case for a given algorithm, but the performance is horrible. You really want to do yourself a favour and run this stuff on a 64 bit architecture.

        Maybe there's another monk who's into numerical analysis and can spot an insight, but it's beyond my ken.

        • another intruder with the mooring in the heart of the Perl

        ... in my system (win xp, 32 bit), perl doesn't recognize quads :( I tried with LL, but it computes really slow, and I still get '00000000ffffffff' ...

        So, what does this say about the C implementation that you pointed to on that wiki page? Does the same machine run the C version reasonably fast? If so, then there must be some way that the C library on that machine is able to do 64-bit arithmetic without being reduced to a crawl.

        I'm also confused because the OP starts by talking about converting python code to perl. How long does it take for the python version to run on this same machine?

        Have you checked the output of "perl -V" on your machine, and does that include something like:

        d_longlong=define, longlongsize=8
        If it doesn't, maybe you just need to build your perl installation differently. (I'm just guessing -- I don't even know how to confirm whether this is relevant at all -- but it's worth looking at.)
Re: Creating custom hash of file
by GrandFather (Saint) on Mar 18, 2007 at 10:52 UTC

    So people can check any code they give it would be good to have a standard test file. That is pretty easy to generate as part of your sample code by adding:

    open outFile, ">", "test.txt"; print outFile ' ' x 65536; close outFile;

    then change your open to open the test file.

    You should also provide the correct calculated value so we know when the calculation is correct.


    DWIM is Perl's answer to Gödel
      testing avi/hash code is on page I gave a link. But Anyway, I did this:
      open outFile, ">", "test.txt"; print outFile ' ' x (65536 * 4); close outFile; #hash is: 08080808080c0000
      I know this hash technique is not strong, but I can't do anything about it. It is same as Media Player Classic is using.
Re: Creating custom hash of file
by 2ge (Scribe) on Jan 01, 2008 at 19:43 UTC
    Maybe this will help to someone: we came to this solution simulating UInt:
    #!/usr/bin/perl use strict; use warnings; print OpenSubtitlesHash('breakdance.avi'); sub OpenSubtitlesHash { my $filename = shift or die("Need video filename"); open my $handle, "<", $filename or die $!; my $fsize = -s $filename; my $hash = [$fsize & 0xFFFF, ($fsize >> 16) & 0xFFFF, 0, 0]; $hash = AddUINT64($hash, ReadUINT64($handle)) for (1..8192); my $offset = $fsize - 65536; seek($handle, $offset > 0 ? $offset : 0, 0) or die $!; $hash = AddUINT64($hash, ReadUINT64($handle)) for (1..8192); close $handle or die $!; return UINT64FormatHex($hash); } sub ReadUINT64 { read($_[0], my $u, 8); return [unpack("vvvv", $u)]; } sub AddUINT64 { my $o = [0,0,0,0]; my $carry = 0; for my $i (0..3) { if (($_[0]->[$i] + $_[1]->[$i] + $carry) > 0xffff ) { $o->[$i] += ($_[0]->[$i] + $_[1]->[$i] + $carry) & 0xffff; $carry = 1; } else { $o->[$i] += ($_[0]->[$i] + $_[1]->[$i] + $carry); $carry = 0; } } return $o; } sub UINT64FormatHex { return sprintf("%04x%04x%04x%04x", $_[0]->[3], $_[0]->[2], $_[0]-> +[1], $_[0]->[0]); }