Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Efficient giant hashes

by crenz (Priest)
on Mar 10, 2005 at 13:14 UTC ( [id://438233]=perlquestion: print w/replies, xml ) Need Help??

crenz has asked for the wisdom of the Perl Monks concerning the following question:

We have a system here that uses very large datastructures. For some projects, the size of e.g. a certain hash can reach up to 100_000 elements: The script becomes notably slower when reaching this size. Is there any way to speed this up? I was thinking of using tie with a suitable Tie:: module that is faster than perl's default implementation. I guess it should be possible to be much faster if you limit yourself to storing literal values only, not references. Any recommendations?

Replies are listed 'Best First'.
Re: Efficient giant hashes
by dragonchild (Archbishop) on Mar 10, 2005 at 13:28 UTC
    The script becomes notably slower when reaching this size.

    100_000 elements is going to be, roughly, 1-2K per element. That's 100M-200M of RAM. Unless each element is some huge data structure in its own right (like an array of hashes or somesuch), you're probably not doing a lot of swapping to disk or anything like that.

    It sounds like your algorithm(s) aren't scaling with your data structures. For example, doing

    foreach my $key (keys %hash) { ... }
    is going to be considerably slower than the equivalent
    while (my ($key, $value) = each %hash) { ... }
    foreach has to build the list in memory, then iterate over it. each will only bring one value in at a time.

    My bet is on your algorithms, not your data structures. Maybe if you posted a few snippets of how you use this massive hash, we might be able to help you out.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      foreach my $key (keys %hash) { }
      Is that actually slower? Got benchmarks to back that up?

      And if it IS, then why should it be? There are specialized magic iterators when the parser recognizes an impending iteration with constants, like for (1 .. 1_000_000). Why does perl not implement an automatic iterator when the parser notices a simple sort-free for (keys %foo)? That's such a common idiom I would be amazed it wasn't getting special attention.

      --
      [ e d @ h a l l e y . c c ]

        Anonymonk beat me to the punch. But, the reason for why foreach (keys) and while (each) behave differently has nothing to do with keys being an iterator or not. (Well, it does, but not really.) It has to do with the difference in behavior between foreach and while. foreach is defined to operate on a list. If you give it a list, then you're good. If, however, you give it a function or keyword, then it has to call that function/keyword and construct a temporary list with the return value(s). Incidentally, this is why the following doesn't DWIM:
        my @x = 1 .. 5; sub get_x { @x } foreach my $v ( get_x() ) { $v *= 2; } print "@x\n";

        while, however, re-evaluates its condition every time. This is why the while-loop goes infinite, but the foreach-loop doesn't.

        while (my ($k, $v) = each %hash) { $hash{ $k . 'a' } = 1; } foreach my $k (keys %hash) { $hash{ $k . 'a' } = 1; }

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        Got benchmarks to back that up?
        #!/usr/bin/perl use strict; use warnings; use Benchmark qw 'cmpthese'; our %hash = map {$_, $_} 1 .. 100_000; cmpthese (-10, { keys => '$a = 0; for (keys %hash) {$a += $_}', while => '$b = 0; while ($_ = each %hash) {$b += $_}', }); __END__ Rate keys while keys 5.23/s -- -26% while 7.05/s 35% --
        Now, that's for a simple hash. If the hash is tied to a huge DBM file, the results would be far more dramatic.
        Why does perl not implement an automatic iterator when the parser notices a simple sort-free for (keys %foo)? That's such a common idiom I would be amazed it wasn't getting special attention.
        Because it isn't common idiom, and changing it to an iterator changes the results. Perl has an iterator, and it's called each. You can't change keys to an iterator:
        for my $k1 (keys %hash) { for my $k2 (keys %hash) { } } for my $k (keys %hash) { %hash = (); .... }
        Changing either of the "simple sort-free for (keys %foo)" to an iterator will break the code.
Re: Efficient giant hashes
by Joost (Canon) on Mar 10, 2005 at 13:33 UTC
    What do you do with the data? Do you construct a new structure often? Do you need to iterate through the whole structure or do you typically only need to access a few elements? Is this in a persistent application or in a (CGI) script that needs to start for each request?

    100_000 elements is not that much (it takes about 11 Mb on my machine if filled with simple integers), so what do you put in the keys and values?

Re: Efficient giant hashes
by BrowserUk (Patriarch) on Mar 10, 2005 at 19:00 UTC

    I've evolved various techniques for constructing less memory hungry data structures that have hash-like properties. See A (memory) poor man's <strike>hash</strike> lookup table. for one example.

    There are caveats with each method, in as much as, they are not general-pupose hash replacements--Perls hashes are about as good as you will get if you need the full power of them. But for any individual application, there are often more compact representations that can be utilised.

    The trick is to work out what subset of the properties of a hash you require, and look for a solution that meets those requirements with a reduced memory footprint. Obviously that will come at the sacrifice of other properties that you don't need for the given application.

    To be able to suggest an appropriate solution requires that you describe the use you are making of your large hash. Any solution that might result would probably not be a "module".


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
Re: Efficient giant hashes
by Anonymous Monk on Mar 10, 2005 at 14:19 UTC
    If you suddenly see a notably drop in performance, it's most likely you have hit the treshold where it starts swapping. I'm a bit surprised that you hit the limit so soon, what are you storing in your hash?
    use Devel::Size 'total_size'; my %hash = map {$_, rand() . ""} 1 .. 100_000; print total_size(\%hash), "\n"; __END__ 7713194
    That's less than 8Mb, which shouldn't be much of a problem on a modern machine.

    But fact is, Perl is memory hungry. The more structures you have, the more memory you use. The more complex the structures are, the more memory you use.

    Speeding it up is only possible by using less memory at a time. Using tie as you propose it is not going to solve it. If there would be a known way of speeding up hash access, it would already be in the core! Not to mention that tieing is a slow mechanism, as it means that for each hash access, no matter how trivial, a Perl subroutine will be called. It is possible to use the tie mechanism to store the hash on disk instead of memory, but unless you otherwise would run out of memory, that's not going to change your performance for the better. Regular disk access is not likely to be faster than accessing your swap area.

      Speeding it up is only possible by using less memory at a time.

      Not if what he is trying to do is has a complexity greater than O(n), i.e., if he is entering into another loop of some kind for each element (or even for just every ith element).
        Well, yeah, he could do anything else in his program that could be made faster. But since we don't know the rest of his program, nor is his question about that, it's nothing more than pure speculation, and rather pointless.

        Perhaps he's recompiling perl each time he inserts a hash element.

Re: Efficient giant hashes
by crenz (Priest) on Mar 11, 2005 at 09:58 UTC

    Thanks all for your comments, they have been very helpful. My colleague is sure that the slowness of the script is due to using (big) hashes of hashes of hashes, and that it would be much faster to reimplement the system in C. I'm not so sure that would be worth the effort, and would like to do some profiling first to find out what's going on. Your comments are good ammunition for that :-)

      I think you're the smarter of the two of you. Re-writes in my experience should be a last resort. There is so much history attached to legacy products that it is nearly impossible to incorporate the years of bug fixes and gotchas that previous employee's have packed in.

      I'm not saying its never warranted, but it certainly shouldn't be one of the first options IMO.
Re: Efficient giant hashes
by japh (Friar) on Mar 12, 2005 at 10:37 UTC
    If your problem is lookup time, consider making your hash multidimensional so perl has to look at fewer entries. This is akin to moving /home/joe to /home/j/joe on a system with thousands of users -- daemons only have to search 1/26th of the filespace to find their target file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://438233]
Approved by Joost
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-20 03:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found