Efficient giant hashes

crenz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient giant hashes by dragonchild (Archbishop) on Mar 10, 2005 at 13:28 UTC
The script becomes notably slower when reaching this size. 100_000 elements is going to be, roughly, 1-2K per element. That's 100M-200M of RAM. Unless each element is some huge data structure in its own right (like an array of hashes or somesuch), you're probably not doing a lot of swapping to disk or anything like that. It sounds like your algorithm(s) aren't scaling with your data structures. For example, doing `foreach my $key (keys %hash) { ... }` [download] is going to be considerably slower than the equivalent `while (my ($key, $value) = each %hash) { ... }` [download] foreach has to build the list in memory, then iterate over it. each will only bring one value in at a time. My bet is on your algorithms, not your data structures. Maybe if you posted a few snippets of how you use this massive hash, we might be able to help you out. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l] [select]
Re^2: Efficient giant hashes by halley (Prior) on Mar 10, 2005 at 14:54 UTC
`foreach my $key (keys %hash) { }` [download] Is that actually slower? Got benchmarks to back that up? And if it IS, then why should it be? There are specialized magic iterators when the parser recognizes an impending iteration with constants, like `for (1 .. 1_000_000)`. Why does perl not implement an automatic iterator when the parser notices a simple sort-free `for (keys %foo)`? That's such a common idiom I would be amazed it wasn't getting special attention. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re^3: Efficient giant hashes by dragonchild (Archbishop) on Mar 10, 2005 at 15:32 UTC
Anonymonk beat me to the punch. But, the reason for why `foreach (keys)` and `while (each)` behave differently has nothing to do with keys being an iterator or not. (Well, it does, but not really.) It has to do with the difference in behavior between foreach and while. foreach is defined to operate on a list. If you give it a list, then you're good. If, however, you give it a function or keyword, then it has to call that function/keyword and construct a temporary list with the return value(s). Incidentally, this is why the following doesn't DWIM: `my @x = 1 .. 5; sub get_x { @x } foreach my $v ( get_x() ) { $v *= 2; } print "@x\n";` [download] while, however, re-evaluates its condition every time. This is why the while-loop goes infinite, but the foreach-loop doesn't. `while (my ($k, $v) = each %hash) { $hash{ $k . 'a' } = 1; } foreach my $k (keys %hash) { $hash{ $k . 'a' } = 1; }` [download] Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l] [select]
Re^3: Efficient giant hashes by Anonymous Monk on Mar 10, 2005 at 15:15 UTC
Got benchmarks to back that up? `#!/usr/bin/perl use strict; use warnings; use Benchmark qw 'cmpthese'; our %hash = map {$_, $_} 1 .. 100_000; cmpthese (-10, { keys => '$a = 0; for (keys %hash) {$a += $_}', while => '$b = 0; while ($_ = each %hash) {$b += $_}', }); __END__ Rate keys while keys 5.23/s -- -26% while 7.05/s 35% --` [download] Now, that's for a simple hash. If the hash is tied to a huge DBM file, the results would be far more dramatic. Why does perl not implement an automatic iterator when the parser notices a simple sort-free for (keys %foo)? That's such a common idiom I would be amazed it wasn't getting special attention. Because it isn't common idiom, and changing it to an iterator changes the results. Perl has an iterator, and it's called `each`. You can't change keys to an iterator: `for my $k1 (keys %hash) { for my $k2 (keys %hash) { } } for my $k (keys %hash) { %hash = (); .... }` [download] Changing either of the "simple sort-free `for (keys %foo)`" to an iterator will break the code.	[reply] [d/l] [select]
Re: Efficient giant hashes by Joost (Canon) on Mar 10, 2005 at 13:33 UTC
What do you do with the data? Do you construct a new structure often? Do you need to iterate through the whole structure or do you typically only need to access a few elements? Is this in a persistent application or in a (CGI) script that needs to start for each request? 100_000 elements is not that much (it takes about 11 Mb on my machine if filled with simple integers), so what do you put in the keys and values? "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: Efficient giant hashes by BrowserUk (Patriarch) on Mar 10, 2005 at 19:00 UTC
I've evolved various techniques for constructing less memory hungry data structures that have hash-like properties. See A (memory) poor man's <strike>hash</strike> lookup table. for one example. There are caveats with each method, in as much as, they are not general-pupose hash replacements--Perls hashes are about as good as you will get if you need the full power of them. But for any individual application, there are often more compact representations that can be utilised. The trick is to work out what subset of the properties of a hash you require, and look for a solution that meets those requirements with a reduced memory footprint. Obviously that will come at the sacrifice of other properties that you don't need for the given application. To be able to suggest an appropriate solution requires that you describe the use you are making of your large hash. Any solution that might result would probably not be a "module". Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco.	[reply]
Re: Efficient giant hashes by Anonymous Monk on Mar 10, 2005 at 14:19 UTC
If you suddenly see a notably drop in performance, it's most likely you have hit the treshold where it starts swapping. I'm a bit surprised that you hit the limit so soon, what are you storing in your hash? `use Devel::Size 'total_size'; my %hash = map {$_, rand() . ""} 1 .. 100_000; print total_size(\%hash), "\n"; __END__ 7713194` [download] That's less than 8Mb, which shouldn't be much of a problem on a modern machine. But fact is, Perl is memory hungry. The more structures you have, the more memory you use. The more complex the structures are, the more memory you use. Speeding it up is only possible by using less memory at a time. Using tie as you propose it is not going to solve it. If there would be a known way of speeding up hash access, it would already be in the core! Not to mention that tieing is a slow mechanism, as it means that for each hash access, no matter how trivial, a Perl subroutine will be called. It is possible to use the tie mechanism to store the hash on disk instead of memory, but unless you otherwise would run out of memory, that's not going to change your performance for the better. Regular disk access is not likely to be faster than accessing your swap area.	[reply] [d/l]
Re^2: Efficient giant hashes by perlfan (Vicar) on Mar 10, 2005 at 18:43 UTC
Speeding it up is only possible by using less memory at a time. Not if what he is trying to do is has a complexity greater than O(n), i.e., if he is entering into another loop of some kind for each element (or even for just every i^th element).	[reply]
Re^3: Efficient giant hashes by Anonymous Monk on Mar 14, 2005 at 11:42 UTC
Well, yeah, he could do anything else in his program that could be made faster. But since we don't know the rest of his program, nor is his question about that, it's nothing more than pure speculation, and rather pointless. Perhaps he's recompiling perl each time he inserts a hash element.	[reply]
Re: Efficient giant hashes by crenz (Priest) on Mar 11, 2005 at 09:58 UTC
Thanks all for your comments, they have been very helpful. My colleague is sure that the slowness of the script is due to using (big) hashes of hashes of hashes, and that it would be much faster to reimplement the system in C. I'm not so sure that would be worth the effort, and would like to do some profiling first to find out what's going on. Your comments are good ammunition for that :-)	[reply]
Re^2: Efficient giant hashes by Avox (Sexton) on Mar 11, 2005 at 17:01 UTC
I think you're the smarter of the two of you. Re-writes in my experience should be a last resort. There is so much history attached to legacy products that it is nearly impossible to incorporate the years of bug fixes and gotchas that previous employee's have packed in. I'm not saying its never warranted, but it certainly shouldn't be one of the first options IMO.	[reply]
Re: Efficient giant hashes by japh (Friar) on Mar 12, 2005 at 10:37 UTC
If your problem is lookup time, consider making your hash multidimensional so perl has to look at fewer entries. This is akin to moving /home/joe to /home/j/joe on a system with thousands of users -- daemons only have to search 1/26th of the filespace to find their target file.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks