dexahex has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

What are the pros and cons of using multidimensional hash emulation compared to using hash of hashes?

Consider the following code:

%the_hash = ( empl_john_id => 13, empl_john_position => slave, empl_bob_id => 0, empl_bob_position => manager, client_fred_id => 2, client_fred_phone => 12345, client_goldman_id => 0, client_goldman_phone => 666 ); %the_hash2 = ( empl => { john => { id => 13, position => slave }, bob => { id => 0, position => manager } }, client => { fred => { id => 2, phone => 12345 }, goldman => { id => 0, phone => 666 } } }

Will %the_hash2 work faster for fetch/store operations.

Is there a generalized rule on this TIMTOWDI?

Thanks

Replies are listed 'Best First'.
Re: multidimensional hash emulation vs hash of hashes
by moritz (Cardinal) on Aug 18, 2011 at 15:30 UTC
    Will %the_hash2 work faster for fetch/store operations.

    You can use Benchmark to find out.

    There are two main advantages of nested hashes; the first is that there is no ambiguty if the separator character (here _) appears in the keys. The second is that you can easily extract hash refs of data that belong together. For example if you have a subroutine that on one person's data, you can just pass $the_hash2{client}{fred} to it, and it has all the data it needs to know about, and no more.

    In terms of performance I can only guess that emulated hashes typically win by a close margin, but again that should be benchmarked.

Re: multidimensional hash emulation vs hash of hashes
by kennethk (Abbot) on Aug 18, 2011 at 15:36 UTC
    First, anytime you are curious about a performance question, you should Benchmark. Nothing gives you truth like experiment.

    #!/usr/bin/perl -w use strict; use Benchmark qw':all :hireswallclock'; my %the_hash = ( empl_john_id => 13, empl_john_position => 'slave', empl_bob_id => 0, empl_bob_position => 'manager', client_fred_id => 2, client_fred_phone => 12345, client_goldman_id => 0, client_goldman_phone => 666 ); my %the_hash2 = ( empl => { john => { id => 13, position => 'slave' }, bob => { id => 0, position => 'manager' } }, client => { fred => { id => 2, phone => 12345 }, goldman => { id => 0, phone => 666 } } ); cmpthese(10**6, { 'emulation' => sub {$the_hash{client_goldman_phone}++ }, 'HoH' => sub { $the_hash2{client}{goldman}{phone}++ }, });

    yields

    (warning: too few iterations for a reliable count) Rate HoH emulation HoH 1883239/s -- -27% emulation 2564103/s 36% --

    In a literal sense, multidimensional hash emulation will be faster than using a hash of hashes (at least for this test). This is mostly because there is no need for dereferencing the sub elements (I think). However, this is highly unlikely to be the bottleneck in your code, and multidimensional hash emulation is considered poor form because it makes code harder to understand, and hence harder to debug and maintain. This technique's usage mainly predates the introduction of proper references into Perl.

    For a discussion of hash performance, see A short meditation about hash search performance.

      While certainly a good first step, this benchmark is very one-dimensional (benchmarks usually are :-)

      For example the choice of data representation also dictates how you process the keys for accessing the information, and the data structure also needs to be build, which might also take a significant amount of time.

Re: multidimensional hash emulation vs hash of hashes
by blue_cowdawg (Monsignor) on Aug 18, 2011 at 15:33 UTC
        Is there a generalized rule on this TIMTOWDI?

    My own personal feeling on this: I tend to look at hashes the same way I looked at hashes the same way I looked at structs in C programming. It's a handy-dandy way of grouping related information together so you can act on that group as a single unit. This is not to be confused with C++ classes in any way.

    Since hashes are put together on the fly Perl has no way of enforcing all of the elements are present when the hash is created. Let's look at an example of what I'm talking about:

    my $company = { employees => { fred => { fullname => "Freddy Freeloader", position => "panhandler", management => "no" }, clem => { realname => "Clem Kaddiddlehopper", job => "questionable", has_reports => "no" } } };
    In the first employee record we see the fields fullname, position and manageent and in the second record we see realname, job and has_reports. OK, so which is real and which is Memorex?

    In spite of that if I am going to group together data without resorting to creating a Perl module I'll use a HoH structure, but just be sure you are keeping track of what you are putting in there.

    You asked which is faster. I think this is a non-sequitor in this case. If you are using field name such as client_goldman_phone and friends you are burdening yourself as a programmer to coming up with a way to search all of the names, append subfields and all sorts of cruft that gets in the way of good programming. Think maintainability. When I write Perl code my assumption is that someday someone else may have to maintain it. If you are doing crufty things then you are leaving behind crufty code for folks inheriting your code to deal with and your name will be dragged through programming mud.

    Coding keys %{$company{employees}} is easier to write and much easier to read than some sort of fugazy  my @keys = grep "employees", keys %company; in my humble opinion.

    HTH


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: multidimensional hash emulation vs hash of hashes
by NetWallah (Canon) on Aug 18, 2011 at 16:52 UTC
    If this is to be extendable, to maintain (my) sanity, I would immediately go OO:
    Create packages (classes) for Employee, Client, and, if necessary, a container class for "Company".

    IMHO - performance be damned. Computer overhead is negligible compared to maintenance, and coding overhead.

                "XML is like violence: if it doesn't solve your problem, use more."

Re: multidimensional hash emulation vs hash of hashes
by locked_user sundialsvc4 (Abbot) on Aug 18, 2011 at 17:08 UTC

    Quite frankly, I have begun to appreciate the phrase that, “at 100 billion ops a second, no one can hear you scream.”   I think that you should simply focus on writing clear, easily-maintainable code, done in some well-documented fashion which avoids the proliferation of “thousands of statements scattered all over the code, every one of which must be changed in exactly the same way at exactly the same time.”   You should give a passing thought to the one-and-only-one thing which might actually slow this program down, namely:   page faults.   But also carefully consider human effort required to maintain it going forward, weighed properly against the risk associated with inadequate testing or some maintenance oversight.