chuckd has asked for the wisdom of the Perl Monks concerning the following question:

Has anyone ever had a problem with Perl and memory issues when using very long unusual values as keys for a hash. I know that most people will use a short simple key like a number, name, etc. but I accidentally started using a very long string with different types of characters (it was pulled from a text file) as the key. When I get up to a certain size in the hash (around 1 million) perl's memory usage shoots up to about 1 GIG from about 500 MB. Hash anyone had any experience with this? Can anyone point me to any articles on Perl hash structures/allocations? Can anyone tell me if using long keys will contribute to Perls mem usage shooting through the roof?

Replies are listed 'Best First'.
Re: strange keys with a hash
by jethro (Monsignor) on Jan 30, 2009 at 19:56 UTC

    Perl allocates a number of buckets per hash. Whenever a certain percentage of the buckets are filled with hash keys, the number of buckets gets doubled. Because of the hash algorithm a key then either stays in its bucket or has to go to the corresponding bucket of the new space.

    When you print a hash in scalar context, it tells you how many buckets it has and how many are filled. Try this:

    perl -e ' for $i (1..100) {$h{$e++}=1; print scalar(%h),"\n" }'

    Note that not every key entry occupies a new bucket. That is the case when two different keys produce the same hash number and therefore occupy the same bucket.

      Wow! I just recently couldn't figure out why one function returns me '1/8', so now I've got it. Thanks for that!

        Perl starts with 8 "buckets", very nice and efficient for small hash tables. Each time the hash table doubles all the hash keys have to recomputed (one more bit is added to the binary key). If you know that your hash table is gonna be huge, you can save some computation by setting the number of buckets to be large to begin with by assigning a scalar value to %hash=16384 or whatever, Perl will round to the next higher power of 2 (ie, 16,000 is 16384). This doesn't matter for small hashes, but can save some time for huge ones and avoids the re-computations at doubling boundaries:(8,16,32,64,128,256,512,etc..). For like 90K hash keys, I would start with say 32768 (somewhat less than expected numkeys/2). You have to test to see if this matters at all in your application. I just got a few % increase in one of my apps which didn't amount to much, but then again, if I create something with 120K hash entries, I'm gonna whomp on it for awhile! And the mips to create the table just got dwarfed by what I did with it later. Your mileage may vary!
Re: strange keys with a hash
by zwon (Abbot) on Jan 30, 2009 at 19:46 UTC

    Obviously, if you use longer keys, perl requires more memory to store them. Here's a simple program that creates a hash with given key length and shows how much memory it uses, so you can see how it depends:

    use strict; use warnings; use Devel::Size qw(total_size); my $length = $ARGV[0]; my $key = "a" x $length; my %hash; for ( 1 .. 1_000_000 ) { $hash{$key} = 1; $key++; } print "key size: $length\n"; print `ps -p $$ v`; print total_size( \%hash ), "\n"; __END__ key size: 8 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22551 pts/0 S+ 0:01 0 3 146484 128864 3.1 perl hash_k +eys.pl 8 74388680 key size: 16 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22557 pts/0 R+ 0:01 0 3 146484 128860 3.1 perl hash_k +eys.pl 16 82388680 key size: 32 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22563 pts/0 R+ 0:01 0 3 162116 144488 3.5 perl hash_k +eys.pl 32 98388680 key size: 64 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22569 pts/0 S+ 0:01 0 3 193384 175740 4.3 perl hash_k +eys.pl 64 130388680 key size: 128 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22575 pts/0 R+ 0:01 0 3 255936 238240 5.8 perl hash_k +eys.pl 128 194388680 key size: 256 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22581 pts/0 R+ 0:02 0 3 380928 363244 8.9 perl hash_k +eys.pl 256 322388680 key size: 512 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22587 pts/0 S+ 0:04 0 3 630924 613252 15.1 perl hash_k +eys.pl 512 578388680 key size: 1024 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 22597 pts/0 R+ 0:07 0 3 1130868 1113256 27.4 perl hash +_keys.pl 1024 1090388680