Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^3: 32bit/64bit hash function: Use perls internal hash function?

by eyepopslikeamosquito (Archbishop)
on Apr 09, 2022 at 02:36 UTC ( #11142865=note: print w/replies, xml ) Need Help??


in reply to Re^2: 32bit/64bit hash function: Use perls internal hash function?
in thread 32bit/64bit hash function: Use perls internal hash function?

If you could illustrate your problem with an smallish example benchmark program that would be good and should lead to better answers.

Without seeing any code, we are forced to guess. I am guessing your problem is similar to a thread I mentioned earlier, where some folks suggested using an external database (such as SQLite) instead of a gigantic Perl hash. From that thread, this quote from the infamous BrowserUk sounds similar to the problem you're facing:

I've run Perl's hashes up to 30 billion keys/2 terabytes (ram) and they are 1 to 2 orders of magnitude faster, and ~1/3rd the size of storing the same data (64-bit integers) in an sqlite memory-based DB. And the performance difference increases as the size grows. Part of the difference is that however fast the C/C++ DB code is, calling into it from Perl, adds a layer of unavoidable overhead that Perl's built-in hashes do not have.

Have you tried using a database (such as SQLite) instead of a gigantic Perl hash? If so, what was the performance difference?

Though I can't personally endorse Judy Arrays because I've never used them, they worked for BrowserUk, albeit with a warning that the Judy code is horribly complex and hard to understand if something goes wrong.

  • Comment on Re^3: 32bit/64bit hash function: Use perls internal hash function?

Replies are listed 'Best First'.
Re^4: 32bit/64bit hash function: Use perls internal hash function?
by sectokia (Pilgrim) on Apr 10, 2022 at 11:11 UTC

    In terms of database performance: Basically I have ~120m+ records, about 10m per day added, and 10m deleted. And the need to fetch (via UDP request) a record by its identifer and respond within <5mS, with a peak around 100 to 150 requests per second.

    No database we have tried consistently achieves this. Most of them can do it for a while, then slow down, or inconsistently achieve it. MSSQL can go for 10+ seconds without answering a basic query if you are inserting and deleting heavily at the same time. Same with mssql. Even Redis will suddenly lurch for hundreds of milliseconds especially when large numbers of records all expire at the same time, including when you have it running under linux as RT with top FIFO scheduling. The simple fact seems to be almost no databases don't have branches that cause large amount of I/O and internal work to be done before they get back to doing their job of answering your query.

    In comparison this hash perl cache thing, has a perfect record because it never has to do anything else. So long as it has priority to disk and CPU, it works flawlessly. I can ingest over 100,000 records per second, while still servicing 5,000 requests per second on a modest PC. It achieves <1mS 100% of the time.

    In terms of why I want a faster hasher: The number of hashes per insert is 1 when the hash table is empty, averages 2 when the hash table is half full, and obviously ramps up to infinity when the hash table is full. By moving to a faster hash algorithm, I am able to move the hash table size down toward the total number of records (because as you do that number of hashes for an insert goes up) which wastes less RAM but increases CPU.

    In terms of the 8KB vs 512b IO fetch: This allows the number of records per second I can fetch to approach the storage/SSD's IOPS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11142865]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2023-12-06 14:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (30 votes). Check out past polls.

    Notices?