Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^5: can I change hash keys or values directly

by Marshall (Canon)
on Feb 05, 2021 at 03:43 UTC ( [id://11127911]=note: print w/replies, xml ) Need Help??


in reply to Re^4: can I change hash keys or values directly
in thread can I change hash keys or values directly

Yes and No.

And this means a stored key must be immutable , because that calculated array-index would change too.

The stored key is immutable in this sense, but the calculated array-index for the linked list containing that particular key can change based upon the size of the hash.

This is true:
Consequently there can't be a faster way than replacing the whole key-value pair with "delete OLD" + "store NEW".

One hashing algorithm that has been used before in Perl is:

unsigned int hash = 0; while (*s) hash = hash * 33 + *s++;
s is a pointer to a null terminated ASCII string.
This runs "rocket fast".
On a modern processor, multiply by 33 is faster than left shift by 4 (times 32) and then an add.

--UPDATE ---

Just for fun, I attach actual C code from memtracker.h from 2010 and the 2013 update.
The program uses an integer as the "hash key". Some obvious performance enhancements mentioned in the comments to version 2 were not actually coded because it just didn't matter! The longer version is much, much faster. Shorter code does not equal faster code. The HTTP addresses in the code probably don't work anymore since they are more than a decade old! Also noteworthy: this code was only tested on a 32 bit machine.

/* MT_gen_hash_bits() * Dan Bernstein's hash function - works great for short ASCII strin +gs! * This is an empirical thing that "just works" without mathematical * proof. Noteworthy: this is algorithm used by Perl. * http://burtleburtle.net/bob/hash/doobs.html */ unsigned int MT_gen_hash_bits (unsigned long int address) { char ascii[21]; // max 64 bit unsigned int is 20 decimal digits sprintf (ascii, "%lu", address); char *s = ascii; unsigned int hash = 0; while (*s) hash = hash * 33 + *s++; return hash; } /* MT_gen_hash_bits_version2() * Thomas Wang's 64-bit hash function - works well for integers, and + is significantly * faster than the DJB function since the key is not ASCII. It is a +lso slightly * better in distributing keys. * http://www.concentric.net/~Ttwang/tech/inthash.htm */ unsigned int MT_gen_hash_bits (unsigned long int address) { address = (~address) + (address << 21); // address = (address << 21) + - address - 1; address = address ^ (address >> 24); address = (address + (address << 3)) + (address << 8); // address * +265 address = address ^ (address >> 14); address = (address + (address << 2)) + (address << 4); // address * +21 address = address ^ (address >> 28); address = address + (address << 31); return (unsigned int)address; }

Replies are listed 'Best First'.
Re^6: can I change hash keys or values directly
by LanX (Saint) on Feb 05, 2021 at 08:38 UTC
    > but the calculated array-index for the linked list containing that particular key can change based upon the size of the hash

    Well yes, but in the case when the hash-table is doubled, the whole internal data is overhauled anyway and effectively a new hash is created.

    The old linked lists can't be reused. (At least I don't see how this could be done, the collisions need to be reduced by using the new slots)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Yes, quite correct.
Re^6: can I change hash keys or values directly
by LanX (Saint) on Feb 05, 2021 at 16:11 UTC
    Regarding your update:

    Thanks, very interesting! :)

    Please pardon my C ignorance, but AFAIS is the second hashing version operating on the pointer to the key's string not the string itself.

    Right?

    This would have 3 consequences

    • the randomization to avoid DoS attacks is based on the assumption that these pointers are not deterministic
    • the keys are indeed mutable, as long as the new key fits into the allocated space of the old one -I. E. is shorter.
    • memory management guarantees that those pointers are constant.° Last time I worked "close to the metal" MMUs were only an exotic feature allowing virtual addresses.

      I suppose all Perl ports control this issue?

      Like integrity of the hash after being swapped?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    °) or the function is stable modulo memory page boundaries. But this might make then predictable again

      I just threw these hashing algorithms in as examples - seemed at least tangentially relevant to the discussion of "how does a Perl hash work?". The first one is close to what Perl used to do. The second one works differently. They are equivalent at the interface level. Performance note: the left shift then add stuff is not necessary. To multiply by 5 => left shift 2 (multiply by 4) then add original => a multiply by 5. On a modern processor, integer shift, add, and multiply are all about as fast. That's quite astonishing, but true. The math unit has way more transistors in it than the actual CPU. Intel and AMD have spent a lot of time making math work faster. However the shown code is so fast, this doesn't matter. Anyway, multiply is faster than you might think. And also applies to Perl.

      Both algorithms operate on a binary memory pointer. In the first one, I made the binary pointer into a string with sprintf() because the Perl algorithm is optimized for strings. That one is "slow" because sprintf is slow. The second algorithm is specifically designed to generate hashing bits from a binary number.

      There is no DoS security issue with my application. I'm not building hash tables based upon user supplied data. All data comes from C/C++ memory allocation operations (binary memory pointers). If I am accepting data from a user application and that application knows the hash algorithm, it is possible for me them? to generate ascii data strings which will cause lots of hash "collisions" which cause the hash to double in size way, way more often than one would normally expect.

      I don't know what you mean by keys are indeed mutable? In my C application, all the hash keys are the same size, true. The binary hash key results in a binary hash value. The size of the bucket array depends upon how many bits of the binary hash value are used as an index into that bucket array. If a key "changes" usually it will cause the entry to move to a different "hash bucket". There is no special case code for: "hash key changed, but by chance the new key hashes to the same hash bucket". That case, by hash algorithm design, is rare. "Change hash key" means delete old key, create new key. That may or may not cause the hash size (the size of the bucket array) to double.

      I don't know what you mean by this: Like integrity of the hash after being swapped?. If mean does Perl ensure the integrity of the hash when it doubles? Yes! If you mean, "if some physical memory is swapped to disk, does the O/S MMU figure that out to make it transparent to the application?", Yes. All of that has gotten much easier over the years. One project I had required 16 MB of physical memory on a 80186 processor which only handled 1 MB. I had to work with the O/S and Application folks to design some effective, fast, easy to use hardware to make this possible. Now the problem is in the inverse, the processor can produce more address bits than exist in the physical sense.

        OK, Sorry I was under the impression you were showing parts of Perl's implementation ...

        ... and this raised some security doubts.

        It also surprised me to see that the address to the key is hashed and not the string of the key.

        Cause this has some implications.

        • For example these pointers must stay "stable" after memory re-allocations and I have seen Perl ports to machines without MMU listed (like running on Motorola 68000 boxes).
        • It's also that you can normally replace the string behind the pointer with a shorter one, which would make keys mutable (in a limited way).
        Anyway ...

        Thanks for the enlightening conversation! :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11127911]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-26 05:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found