Hash space/ time tradeoff

Wiggins has asked for the wisdom of the Perl Monks concerning the following question:

I am looking at a possible "time/space" trade-off problem. I am evaluating storing a set of data that is keyed by 3 attributes:

attr1 is a number in the range 200-35000
attr2 is a string, 6-9 characters
attr3 is a string of 10 characters

The hash key is "attr1.attr2.attr3", The hash content is a ref to an anon array.

Could there be a significant performance improvement by maintaining a second hash of just 'attr1' to pre-determine if any keys start with attr1?

   my $arrayRef;
   if ( defined ?attr1Hash{$attr1}){
      my $key ="attr1.attr2.attr3";
      $arrayRef = $fullHash{$key};
   } 
   if ( defined $arrayRef){
      ...
   }
[download]

   my $arrayRef;
   my $key ="attr1.attr2.attr3";
   $arrayRef = $fullHash{$key};
   if ( defined $arrayRef){
      ...
   }
[download]

UPDATE:

I have a process that is watching (ala 'less') the system syslog file for pre-determined message keywords, which are reported across the net to a higher authority. Some of the tests done on a record may cause sub-shells to be spawned to determine system state.
But basicly pretty straight forward.

Enhancement creep:

Generate a special event report if any unique process(pid), in a specified time period, creates more than 3 of the syslog messages. I have gone from statless scan to a time limited historic context.

The module I am working on holds the 'state'

    $EventHash{$pid} = [ create_ts, last_upt, $ev_cnt];
[download]

It is always better to have seen your target for yourself, rather than depend upon someone else's description.

Comment on Hash space/ time tradeoff Select or Download Code

Replies are listed 'Best First'.
Re: Hash space/ time tradeoff by roboticus (Chancellor) on Dec 27, 2013 at 17:50 UTC
Wiggins: It depends on the distribution of your key values. If there are many duplicated values for an attribute, you may save a good deal of memory by making a hash level just for that attribute. However, each hash and hash entry will have some overhead. So if the attributes are largely unique, splitting the hash into multiple levels will incur a bit more overhead than simply concatenating the attributes together as you're doing. I'd check the distribution of attr1 first. It's long (averaging 17K per entry^[1]), so the potential for savings is pretty good. If there's enough duplication, then go ahead and split out the hash level. As far as access time goes, I wouldn't worry about it until/unless the code is too slow for your purposes. Hashes are pretty quick for access, so you'd usually need to do quite a few of them for the access time to start being noticeable. If your code is too slow for your purposes, we'd have to know a bit more about your situation before offering useful advice. Note: Yes, I'm assuming that the length of your attr1 is distributed uniformly, and that's not necessarily the case. Even 200 bytes per key can give you some savings, though, so I'm not too worried about the distribution of the key lengths, just whether there's enough duplication to allow a savings of space or not. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: Hash space/ time tradeoff by Random_Walk (Prior) on Dec 27, 2013 at 16:59 UTC
How about using a using a multi dimensional hash `$hash{attr1}{attr2}{attr3}=$datum`. You could also use an array of hash of hash `$ray[attr1]{attr2}{attr3} = $datum;`. The latter may use a bit more space, but 35K array is not so big, and the initial if defined will be faster on an array than a hash. Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l] [select]
Re: Hash space/ time tradeoff by BrowserUk (Patriarch) on Dec 28, 2013 at 02:02 UTC
Could there be a significant performance improvement by maintaining a second hash of just 'attr1' to pre-determine if any keys start with attr1? No! The time saved hashing a slightly shorter key for the first lookup will be completely negated by the need to do a second lookup if the first hits. And using 3 keys will be significantly slower. Moreover, multilevel hashes require considerably more memory (and memory management). With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Hash space/ time tradeoff by GrandFather (Saint) on Dec 28, 2013 at 00:58 UTC
Is the code currently running too slowly or hitting a memory limit? If not, structure your hash so it is easiest and clearest for coding. You can waste a lot of time solving issues you don't have. True laziness is hard work	[reply]
Re^2: Hash space/ time tradeoff by educated_foo (Vicar) on Dec 28, 2013 at 01:45 UTC
After all, if it ain't broke, don't fix it. On the other hand, a stitch in time saves nine. On the third hand, a penny saved is a penny earned. Or maybe that was the fourth hand. The OP has his reasons for asking the question, and as some other commenters have shown, it's a good jumping-off point for discussion of how to reason about different collections of data indexed in different ways. Given that the OP could have written both approaches, benchmarked them, and chosen whichever came out best in about 10 minutes, I don't see "a lot of time" being "wasted."	[reply]
Re^3: Hash space/ time tradeoff by GrandFather (Saint) on Dec 28, 2013 at 04:09 UTC
Given that the OP could have written both approaches, benchmarked them, and chosen whichever came out best in about 10 minutes ... Actually it's likely that the OP couldn't do that. He spent more time than that writing up the question then waited most of an hour for the first answer. So it looks like a 10 minute fix wasn't available and there's the first hour wasted. But the advice doesn't pertain to just this single question or to just the OP. As a general thing write code for clarity first then tweak for speed if you need to. Most often scripts are plenty fast enough as first written. If they're not fast enough then profiling is a better first step than guessing and benchmarking. Often the answer to speed problems is found in a different high level approach than low level tweaking in any case. Tweaking may get a factor of two speed improvement. Finding a better algorithm can get orders of magnitude speed improvements. So, pick your battles and don't sweat the small stuff. True laziness is hard work	[reply]
Re^4: Hash space/ time tradeoff by educated_foo (Vicar) on Dec 28, 2013 at 05:49 UTC
Re^5: Hash space/ time tradeoff by GrandFather (Saint) on Dec 28, 2013 at 08:41 UTC
Re: Hash space/ time tradeoff by Laurent_R (Canon) on Dec 27, 2013 at 19:03 UTC
I doubt that having the secondary hash will save you much time in the setting you are describing, but only benchmarking with actual data can give you the answer. Using an auxiliary hash once enabled me to divide by about 10 the run time of a program, but the setting was very different and the hash made it possible to avoid a combinational explosion my removing early data that was not going to match anything anyway. Your case appears to be quite different.	[reply]
Re: Hash space/ time tradeoff by runrig (Abbot) on Dec 30, 2013 at 18:50 UTC
`if ( defined ?attr1Hash{$attr1}){ my $key ="attr1.attr2.attr3"; $arrayRef = $fullHash{$key}; }` [download] You don't need to build your own keys wtih perl's multidimensional array emulation: `$arrayRef = $fullHash{$attr1,$attr2,$attr3};` [download] See perlvar	[reply] [d/l] [select]
Re^2: Hash space/ time tradeoff by BrowserUk (Patriarch) on Dec 30, 2013 at 19:04 UTC
See perlvar I think you mean perldata. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Hash space/ time tradeoff by runrig (Abbot) on Dec 30, 2013 at 19:58 UTC
I think you mean.. No, I meant what I said...I didn't even know about (or forgot about?) that section in perldata. It's funny that the section you linked to has a link to the section I linked to, but not vice-versa. Wouldn't it be nice if they linked to each other?	[reply]
Re: Hash space/ time tradeoff by locked_user sundialsvc4 (Abbot) on Dec 30, 2013 at 01:42 UTC
Truly, it depends on how you intend to use the data, having (in whatever fashion, by whatever means ...) collected and cataloged it. If you are constantly going to be presented with 3-tuples of specific attributes, and the only question is whether that specific 3-tuple does or does not exist, then your present strategy of using a composite key consisting of a concatenation of the three values is going to be (IMHO) the fastest thing that you can hope for. If there are other things that you need to do with these data, especially searching on partial keys, then “a hash-table” starts to lose its lustre pretty darned quickly, and other possibilities (such as my oft-suggested “SQLite database file”) start to look pretty shiny. Of course, you can have both. A hash-table is very-good at one, but only one, specific thing: determining whether or not an exact-match exists or not. Yes or No. Although you certainly can build multiple hashes, and in Perl you can “reference” the same data-object in multiple hashes without duplicating the actual content, the Swamp of Diminishing Returns is pretty deep. Be careful to step back and take a good hard look at the whole big-picture. Hash-tables are a fantastic-performance, but extremely special-purpose, data structure. (And Perl’s implementation of them is second to none.)