Quick and Dirty Hash

Replies are listed 'Best First'.
Re: Quick and Dirty Hash by roboticus (Chancellor) on Jun 22, 2006 at 01:52 UTC
clancey: The problem you're experiencing is that you aren't mixing up your bits enough. With a simple multiply-accumulate, you'll find that you're not distributing your results evenly through the result space, so similar sequences tend to result in similar values. You may want to try one of the CRC (or similar) algorithms which are designed for error-checking and try to spread the values around. (See also Hash functions, Error detection, Data Integrity) --roboticus	[reply]
Re: Quick and Dirty Hash by GrandFather (Saint) on Jun 22, 2006 at 04:08 UTC
You may be interested in this implementation of the Adler-32 checksum algorithm (see Adler-32). use warnings; use strict; use constant MOD_ADLER => 65521; print Adler32('abc def ghi') . "\n"; print Adler32('acb def ghi') . "\n"; sub Adler32 { my $data = shift; my $scan = 0; my $len = length $data; my $a = 1; my $b = 0; while ($len) { my $tlen = $len > 5550 ? 5550 : $len; $len -= $tlen; do { $a += ord (substr $data, $scan++, 1); $b += $a; } while (--$tlen); $a = ($a & 0xffff) + ($a >> 16) * (65536 - MOD_ADLER); $b = ($b & 0xffff) + ($b >> 16) * (65536 - MOD_ADLER); } $a -= MOD_ADLER if $a >= MOD_ADLER; $b = ($b & 0xffff) + ($b >> 16) * (65536 - MOD_ADLER); $b -= MOD_ADLER if $b >= MOD_ADLER; return ($b << 16) \| $a; } [download] Prints: `378209230 378274766` [download] Note that this is a translitteration from the C code - $a and $b would be better renamed. :) DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Quick and Dirty Hash by japhy (Canon) on Jun 22, 2006 at 01:03 UTC
You could just use `my $cksum = unpack "%32C", $string;` -- see unpack. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart*	[reply] [d/l]
Re^2: Quick and Dirty Hash by clancey (Initiate) on Jun 22, 2006 at 02:54 UTC
unpack does not work. It returns the same value for "abc def ghi" as for "acb def ghi" Thank you for the pointer to some more theory. I will check it out. Why do something myself? Firstly, curiosity. I have MD5 installed on my machine, but I wanted to try my hanbd at my own solution. Secondly, I am looking at something I can include in an open source project I maintain. All the users are on Windows machines and the majority struggle enormously with getting Perl installed, especially when you need extra libs. The problem I am trying to solve is rapidly comparing two files to see if they are the same, but where the files have randomly generated file names and creation times and where the files could be the same size as other files of the same type.	[reply]
Re^3: Quick and Dirty Hash by japhy (Canon) on Jun 22, 2006 at 03:10 UTC
Checksums aren't unique; eventually, you find two strings that differ but have the same checksum. Checksums are merely a precautionary measure. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply]
Re: Quick and Dirty Hash by TedPride (Priest) on Jun 22, 2006 at 04:36 UTC
There are a large number of hashing algorithms you could use. Most of them work better in C / C++ than Perl, however, since you really need unsigned numbers, and there doesn't seem to be a simple way to tell Perl you don't want a signed variable. The following should work ok for strings of more than a few characters: `use strict; use warnings; print hash('dfsdgogmoerigre'); BEGIN { my @c = (0..9, 'A'..'F'); my $c = $#c + 1; sub hash { my ($n, $r) = 0; for (split //, $_[0]) { $n = ord($_) + ($n << 6) + ($n << 16) - $n; } $n = abs($n); while ($n >= $c) { $r .= $c[$n % $c]; $n /= $c; } return $r; } }` [download]	[reply] [d/l]
Re: Quick and Dirty Hash by davidrw (Prior) on Jun 22, 2006 at 01:58 UTC
but without requiring the use of MD5 or similar has programs (not that it's necessarily bad, but) why? If it's for an excercise or performance requirement, sure .. if it's because you "can't" install Digest::MD5 or other modules, why not a local install or PAR for distribution?	[reply]
Re: Quick and Dirty Hash by ikegami (Patriarch) on Jun 22, 2006 at 02:45 UTC
The use of the word hash is confusing, due to the presence of a native structure with the same name. Speaking of... I wonder if it's possible to use Perl's hashing algorithm (through XS).	[reply]
Re: Quick and Dirty Hash by hrr (Monk) on Jun 29, 2006 at 13:58 UTC
If you use ActivePerl on Windows, Digest::MD5 is automatically installed. Then, it seems to me that using this hash algorithm makes much sense, as it is simple and gives reliable results: `use Digest::MD5 qw(md5 md5_hex); $digest = md5($data);` [download] Implementing your own hash means that it is probably more prone to collisions (i.e., that two different files lead to the same hash value). While MD5 has known cryptographic weaknesses, it certainly remains well suited for your task.	[reply] [d/l]