in reply to Creating custom hash of file

Judging by the linked-to code, the hash is just summing the ASCII values of each byte. Translating this to fiarly idiomatic Perl, I come up with the following:

use strict; use warnings; use Fcntl ':seek'; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; $hash += ord($_) for split //, readline($in); seek($in, SEEK_END, -$chunk); $hash += ord($_) for split //, readline($in); close $in; return sprintf '%016x', $hash; } print calc($_), "\t$_\n" for @ARGV;

...but, as GrandFather points out, without a test file and expected results, it's hard to know if there are any bugs in this.

update: I see you've added a test case. And I see that it's actually summing 32-bit values. The following code reflects that fact.

use strict; use warnings; use Fcntl ':seek'; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; $hash += $_ for unpack 'N*' , readline($in); seek($in, SEEK_END, -$chunk); $hash += $_ for unpack 'N*' , readline($in); close $in; return sprintf '%016x', $hash; } my $test = 'test.deleteme'; open my $out, '>', $test; print $out ' ' x (65536*4); close $out; print calc($test), $/; unlink $test;

Hmm, but that doesn't match the test results. Back to the drawing board.

2nd update: Got it! It's summing quads, not longs, and clamping on overflow.

I had to use bigint, which is sub-optimal in that it gets the correct results at the cost of vastly increased execution time. Still, it does the job, and on a 64-bit CPU you should be able to remove the 'use bigint' and it will work just the same, and much, much faster.

use strict; use warnings; use Fcntl ':seek'; use bigint; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; for my $quad (unpack 'q*' , readline($in)) { $hash += $quad; $hash &= 2 ** 64 - 1; } seek($in, SEEK_END, -$chunk); for my $quad (unpack 'q*' , readline($in)) { $hash += $quad; $hash &= 2 ** 64 - 1; } close $in; return sprintf '%016x', $hash; } open my $out, '>', 'test.deleteme'; print $out ' ' x (65536*4); close $out; print calc('test.deleteme'), $/;

• another intruder with the mooring in the heart of the Perl

Replies are listed 'Best First'.
Re^2: Creating custom hash of file
by bart (Canon) on Mar 18, 2007 at 15:06 UTC
    I had to use bigint, which is sub-optimal in that it gets the correct results at the cost of vastly increased execution time.
    Only if you don't use an XS library to do the calculations. Math::BigInt supports several, including Bit::Vector (Math::BigInt::BitVect), Math::Pari (Math::BigInt::Pari), and Math::GMP (Math::BigInt::GMP).
    Judging by the linked-to code, the hash is just summing the ASCII values of each byte.
    Hmm, I thought unpack supported braindead checksumming out of the box... something to do with a "%" sign, apparently. It might suffice to read a 64k block and apply unpack to it, with the proper format.
      The unpack format would have to be "%64q*". That still requires 64-bit support from Perl.

      Anno

Re^2: Creating custom hash of file
by 2ge (Scribe) on Mar 18, 2007 at 11:55 UTC
    Grinder,

    thank you very much for your code, but in my system (win xp, 32 bit), perl doesn't recognize quads :( I tried with LL, but it computes really slow, and I still get '00000000ffffffff' Could you please improve your code, not to use quads and bigint ?

      Hmm, well, it's going to be hard to get rid of 'bigint' if you want 64-bit arithmetic to work correctly :(

      If your perl doesn't know about pack 'q', then you can roll it yourself with the following, but it will be even slower:

      use strict; use warnings; use Fcntl ':seek'; use bigint; use constant CHUNK => 65536; sub calc { my $file = shift or die "no filename given\n"; my $hash = -s $file; my $chunk = CHUNK; $hash < $chunk and die "$file is too small ($hash bytes < $chunk)\ +n"; open my $in, '<', $file or die "Cannot open $file for input: $!\n" +; local $/ = \$chunk; my @val = unpack 'L*' , readline($in); for (my $j = 0; $j < $#val; $j += 2) { $hash += ($val[$j] << 32) + $val[$j+1]; $hash &= 2 ** 64 - 1; } seek($in, SEEK_END, -$chunk); @val = unpack 'L*' , readline($in); for (my $j = 0; $j < $#val; $j += 2) { $hash += ($val[$j] << 32) + $val[$j+1]; $hash &= 2 ** 64 - 1; } close $in; return sprintf '%016x', $hash; } open my $out, '>', 'test.deleteme'; print $out ' ' x (65536*4); close $out; print calc('test.deleteme'), $/;

      Perl's probably not the best language for this. The code is much shorter than the other languages, which is usually the case for a given algorithm, but the performance is horrible. You really want to do yourself a favour and run this stuff on a 64 bit architecture.

      Maybe there's another monk who's into numerical analysis and can spot an insight, but it's beyond my ken.

      • another intruder with the mooring in the heart of the Perl

        thank you, it is more and more closer to the python code (read on wiki page I posted in first message, maybe it will help you). Also, funny thing is, when I try your latest code it is terrible slow (but thats OK as you wrote), but I get different hash. I have:
        This is perl, v5.8.4 built for MSWin32-x86-multi-thread (with 3 registered patches, see perl -V for more detail) perlhash.pl 00000000ffffffff
        When you run this, you get different hash?
      ... in my system (win xp, 32 bit), perl doesn't recognize quads :( I tried with LL, but it computes really slow, and I still get '00000000ffffffff' ...

      So, what does this say about the C implementation that you pointed to on that wiki page? Does the same machine run the C version reasonably fast? If so, then there must be some way that the C library on that machine is able to do 64-bit arithmetic without being reduced to a crawl.

      I'm also confused because the OP starts by talking about converting python code to perl. How long does it take for the python version to run on this same machine?

      Have you checked the output of "perl -V" on your machine, and does that include something like:

      d_longlong=define, longlongsize=8
      If it doesn't, maybe you just need to build your perl installation differently. (I'm just guessing -- I don't even know how to confirm whether this is relevant at all -- but it's worth looking at.)
        Graff, thanks for reply. I have "d_longlong=undef, longlongsize=8", also I'd like to see this hashing program works under any perl. For your question - I have working version under python, and it is pretty fast (1 file - 0.1s or so). With perl it takes ages.