I wanted a short bit of code to generate a checksum on a file or a bit of text, but without requiring the use of MD5 or similar has programs. Following is the function, named ghash for "generate hash" and a couple of lines calling the function to demonstrate that it works. In tests I found that increasing the size of the multiplier past a certain point does not change in the return value when two characters in a larger test file are transposed. However, I remain uncertain if the multiplier is at an optimal value.
$theText = "abc def ghi"; print "this value for $theText == " , ghash($theText), "\n"; $theText = "acb def ghi"; print "this value for $theText == " , ghash($theText), "\n"; ##################### sub ghash() { my ($str) = @_; my $y = length($str); my $ret = 0; for(my $i = 0; $i < $y; ) { $ret += ord( substr($str, $i,1)) * (++$i); } return $ret; }

Replies are listed 'Best First'.
Re: Quick and Dirty Hash
by roboticus (Chancellor) on Jun 22, 2006 at 01:52 UTC
    clancey:

    The problem you're experiencing is that you aren't mixing up your bits enough. With a simple multiply-accumulate, you'll find that you're not distributing your results evenly through the result space, so similar sequences tend to result in similar values.

    You may want to try one of the CRC (or similar) algorithms which are designed for error-checking and try to spread the values around. (See also Hash functions, Error detection, Data Integrity)

    --roboticus

Re: Quick and Dirty Hash
by GrandFather (Saint) on Jun 22, 2006 at 04:08 UTC

    You may be interested in this implementation of the Adler-32 checksum algorithm (see Adler-32).

    use warnings; use strict; use constant MOD_ADLER => 65521; print Adler32('abc def ghi') . "\n"; print Adler32('acb def ghi') . "\n"; sub Adler32 { my $data = shift; my $scan = 0; my $len = length $data; my $a = 1; my $b = 0; while ($len) { my $tlen = $len > 5550 ? 5550 : $len; $len -= $tlen; do { $a += ord (substr $data, $scan++, 1); $b += $a; } while (--$tlen); $a = ($a & 0xffff) + ($a >> 16) * (65536 - MOD_ADLER); $b = ($b & 0xffff) + ($b >> 16) * (65536 - MOD_ADLER); } $a -= MOD_ADLER if $a >= MOD_ADLER; $b = ($b & 0xffff) + ($b >> 16) * (65536 - MOD_ADLER); $b -= MOD_ADLER if $b >= MOD_ADLER; return ($b << 16) | $a; }

    Prints:

    378209230 378274766

    Note that this is a translitteration from the C code - $a and $b would be better renamed. :)


    DWIM is Perl's answer to Gödel
Re: Quick and Dirty Hash
by japhy (Canon) on Jun 22, 2006 at 01:03 UTC
    You could just use my $cksum = unpack "%32C*", $string; -- see unpack.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      unpack does not work.

      It returns the same value for "abc def ghi" as for "acb def ghi"

      Thank you for the pointer to some more theory. I will check it out.

      Why do something myself? Firstly, curiosity. I have MD5 installed on my machine, but I wanted to try my hanbd at my own solution.

      Secondly, I am looking at something I can include in an open source project I maintain. All the users are on Windows machines and the majority struggle enormously with getting Perl installed, especially when you need extra libs.

      The problem I am trying to solve is rapidly comparing two files to see if they are the same, but where the files have randomly generated file names and creation times and where the files could be the same size as other files of the same type.
        Checksums aren't unique; eventually, you find two strings that differ but have the same checksum. Checksums are merely a precautionary measure.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Quick and Dirty Hash
by TedPride (Priest) on Jun 22, 2006 at 04:36 UTC
    There are a large number of hashing algorithms you could use. Most of them work better in C / C++ than Perl, however, since you really need unsigned numbers, and there doesn't seem to be a simple way to tell Perl you don't want a signed variable. The following should work ok for strings of more than a few characters:
    use strict; use warnings; print hash('dfsdgogmoerigre'); BEGIN { my @c = (0..9, 'A'..'F'); my $c = $#c + 1; sub hash { my ($n, $r) = 0; for (split //, $_[0]) { $n = ord($_) + ($n << 6) + ($n << 16) - $n; } $n = abs($n); while ($n >= $c) { $r .= $c[$n % $c]; $n /= $c; } return $r; } }
Re: Quick and Dirty Hash
by davidrw (Prior) on Jun 22, 2006 at 01:58 UTC
    but without requiring the use of MD5 or similar has programs
    (not that it's necessarily bad, but) why?

    If it's for an excercise or performance requirement, sure .. if it's because you "can't" install Digest::MD5 or other modules, why not a local install or PAR for distribution?
Re: Quick and Dirty Hash
by ikegami (Patriarch) on Jun 22, 2006 at 02:45 UTC

    The use of the word hash is confusing, due to the presence of a native structure with the same name.

    Speaking of... I wonder if it's possible to use Perl's hashing algorithm (through XS).

Re: Quick and Dirty Hash
by hrr (Monk) on Jun 29, 2006 at 13:58 UTC
    If you use ActivePerl on Windows, Digest::MD5 is automatically installed. Then, it seems to me that using this hash algorithm makes much sense, as it is simple and gives reliable results:
    use Digest::MD5 qw(md5 md5_hex); $digest = md5($data);
    Implementing your own hash means that it is probably more prone to collisions (i.e., that two different files lead to the same hash value). While MD5 has known cryptographic weaknesses, it certainly remains well suited for your task.