huck has asked for the wisdom of the Perl Monks concerning the following question:
Below is a pure perl version of the lookup2 hash by Bob Jenkins as talked about at A Hash Function for Hash Table Lookup.
Edit: This is an updated version, in particular the mix4 mentioned at Re: Pure perl Jenkins 32 bit Hash is now called mix4x and may be discarded. Also there is code to select the proper version based on $Config{ivsize} and a test call.
{ # has use integer/bytes use integer; use bytes; # http://www.perlmonks.org/?node_id=315881 # http://burtleburtle.net/bob/c/lookup2.c # http://burtleburtle.net/bob/hash/doobs.html # http://search.cpan.org/~shlomif/Digest-JHash/lib/Digest/JHash.pm # http://cpansearch.perl.org/src/SHLOMIF/Digest-JHash-0.10/JHash.xs */ + use constant GOLDEN_RATIO => 0x9e3779b9; use constant A => 0; use constant B => 1; use constant C => 2; use constant FFFFFFFF => 0xffffffff; use constant KEY => 0; use constant INITHASH => 1; sub mix4 ($$$) { # 32bit version # per http://www.perlmonks.org/?node_id=1203705 this is a revised 32bi +t under 'use integer'; $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>>13) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<< 8) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>>13) +; } $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>>12) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<<16) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>> 5) +; } $_[A] -= $_[B]; $_[A] -= $_[C]; { no integer; $_[A] ^= ($_[C]>> 3) +; } $_[B] -= $_[C]; $_[B] -= $_[A]; { no integer; $_[B] ^= ($_[A]<<10) +; } $_[C] -= $_[A]; $_[C] -= $_[B]; { no integer; $_[C] ^= ($_[B]>>15) +; } } sub mix4x ($$$) { # per http://www.perlmonks.org/?node_id=1203705 this is wrong $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>13); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<< 8); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>13); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>>12); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<16); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>> 5); $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] ^= ($_[C]>> 3); $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] ^= ($_[A]<<10); $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] ^= ($_[B]>>15); } sub mix8 ($$$) { # 64bit version $_[A] &= FFFFFFFF; $_[B] &= FFFFFFFF; $_[C] &= FFFFFFFF; $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] = ( $_[A] ^ ($_[C]>>13) ) & + FFFFFFFF; $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] = ( $_[B] ^ ($_[A]<< 8) ) & + FFFFFFFF; $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] = ( $_[C] ^ ($_[B]>>13) ) & + FFFFFFFF; $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] = ( $_[A] ^ ($_[C]>>12) ) & + FFFFFFFF; $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] = ( $_[B] ^ ($_[A]<<16) ) & + FFFFFFFF; $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] = ( $_[C] ^ ($_[B]>> 5) ) & + FFFFFFFF; $_[A] -= $_[B]; $_[A] -= $_[C]; $_[A] = ( $_[A] ^ ($_[C]>> 3) ) & + FFFFFFFF; $_[B] -= $_[C]; $_[B] -= $_[A]; $_[B] = ( $_[B] ^ ($_[A]<<10) ) & + FFFFFFFF; $_[C] -= $_[A]; $_[C] -= $_[B]; $_[C] = ( $_[C] ^ ($_[B]>>15) ) & + FFFFFFFF; } sub jhash_pp_hex { my ($a, $b, $c) = ( GOLDEN_RATIO, GOLDEN_RATIO, $_[INITHASH] ); my ($p, $length) = (0, length $_[KEY]); my $len=$length; my($x,$y,$z); while ($len>=12) { ($x,$y,$z) = unpack 'LLL', substr($_[KEY], $p, 12); $a+=$x;$b+=$y;$c+=$z; mix($a, $b, $c); $p += 12; $len-=12; } # even if len==0 we need another round to mix in the length ($x,$y,$z) = unpack 'LLL', substr($_[KEY] . (chr(0)x12), $p, 1 +2); $z<<=8; # /* the first byte of c is reserved for the length * +/ $z+=$length; $a+=$x;$b+=$y;$c+=$z; mix($a, $b, $c); my $hex = unpack("H*", pack("N", $c)); return $hex; } # jhash_pp_hex use Config; if ( $Config{ivsize} == 4 ) { *main::mix=*main::mix4; } else { *main::mix=*main::mix8; } } # has use integer/bytes print jhash_pp_hex('abcdef',0)."\n";
I had a situation where i could not use Digest::JHash because i did not have access to a compiler.
I needed to hash filenames for a filetracking program using a mysql database. Rather than carry the filename in 4+ tables i use a unique fnid assigned in a fnid table that holds the filename in a text bucket.
As you cannot really index a text field i was originally using md5 on the filename as a key. I knew there can be collisions but that to select a single filename SELECT fnid FROM fnid WHERE fnmd5=? and fn=? would do ok if fnmd5 was a index.
But md5 in hex is 32 chars, and that was real big, bigger than i needed. So i searched for 32 bit hashs, found the jenkins variants and found Re: Fast string hash in portable perl? [DO NOT USE!] by BrowserUk so i decided to try that. My disappointment will be described in a separate reply thread about the history below [history] Pure perl Jenkins 32 bit Hash.
This also calls into question what to do about Digest::JHash, again i will use a separate reply thread [Digest::JHash problem] Pure perl Jenkins 32 bit Hash to talk about that.
Overall i am happy with this. I realize there are newer jenkins and other 32bit hashs, but this will do. It reduced the phpmyadmin reported size of the fnid table by 50%. It is fast enough for my needs, it seems to be the hash perl uses for its hashs.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
[history] Pure perl Jenkins 32 bit Hash
by huck (Prior) on Nov 16, 2017 at 13:01 UTC | |
|
[Digest::JHash problem] Pure perl Jenkins 32 bit Hash
by huck (Prior) on Nov 16, 2017 at 13:02 UTC | |
|
Re: Pure perl Jenkins 32 bit Hash
by Anonymous Monk on Nov 17, 2017 at 19:52 UTC | |
by huck (Prior) on Nov 18, 2017 at 02:56 UTC | |
by huck (Prior) on Nov 18, 2017 at 19:06 UTC |