By booting a minimal configuration of my OS, I managed to persuade Devel::Size to report the size/total_size of the standard hash on my machine and it confirms your figures of ~= 32 & 44 MB respectively. However, I also used a ram disc to reduce my memory capacity in increments and found that 98 MB is the minimum required to create the 1 Million key hash. Any less and with swapping turned off, an out-of memory error occurs before completion.
My OP was purely an exploration of an idea. The intent was to engender discussion re: it's merits versus other mechanisms available or that might be made available for use from P5 in the near term. It's unfortunate that rather than getting to discuss the idea and how it might used or implemented, I ended up having to defend my non-existant "attack" on perl's hashes and justify my claims regarding memory consumption. C'est la vie.
I have now implemented a low-memory hash based on the idea I described in the original post. It is a real hash, ensuring no duplication and supporting values as well as keys. It has limitations, it doesn't support storage of references and keys cannot contain one character used internally (currently ascii 255). Like Tie::SubstrHash, it uses scalars to store the keys and values. However, it doesn't require fixed sized keys/values or table size. These all grow dynamically as required. It currently accepts a single extra parameter at the tie which is used to determine which part of the key is used for indexing.
Currently it uses 16 MB to store the one million keys with undef values. Insertion takes 500 secs and traversal 380 seconds. Almost all of the degradation relative to my original tests is due to the overheads of the tie mechanism. It's these penalties incurred when extending P5's generic datatypes, whether through tieing or OO that make me such an enthusiast for P6 (or maybe Ponie as an interim).
I'm still testing, optimising and documenting, prior to making it available for people to evaluate. Like all new code, it will probably be fragile until it has been exercised for a while in some 'real' applications, but that doesn't put me off from trying new/different approaches to solving problems.
I didn't say that I consider cache optimisation unimportant. I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform. In the documentation for the Judy arrays, they note that the laboriously hand-crafted L1 cache optimisations they developed for it will probably not be (as) beneficial on IA32 or some RISC processors which indicates part of the problem.
Even when correctly targeted, the benefits will often be transitory, as there is a second part to the problem. The Judy arrays are designed to allow large portions of the data structures to be searched/ traversed whilst avoid cache fill delays (on the targeted processor(s)), but the optimisations are only effective whilst the cache remains coherent.
If the application using them traverses the entire structure from a very localised piece of code that doesn't call other code that would cause the caches to be overwritten, then it will fully benefit from the optimisation. But if the code traversing the data structure, calls other code that accesses a different data structure each time through the loop, then the cache will need to be re-filled for each element of the array.
And that's the crux of the problem, cache optimisations only ever work at the local level, but the code utilising the optimised structures is rarely so confined. One piece of code using two highly cache optimised structures alternately will destroy the optimisations as it switches between them. The moment you have multiple applications using the same cache, there is no guarantee that the cache will remain coherent for even a single element access. The task will be interrupted randomly by the scheduler unless extraordinary measure (CritSecs or similar) are deployed to combat this, and whichever other task is scheduled next will overwrite the cache.
I did read somewhere about a concept of runtime optimisation performed by micro-code, but if I recall correctly this was aimed at highly parallelised algorithms like FFT used in weather analysis and similar high-volume, repetative-data applications. I don't recall whether the concept was actually being implemented, or if it was just blue sky, but I have my doubts as to it's applicability to generalise processing and processors -- least-wise given the current state of processor development.
I'm not particularly impressed with DB's in general nor RDBMS's in particular. They force applications to structure their data in formats to fit the database, require the use of a language totally abstracted from the application to access it, and finally force conversions to/ from the applications format to the DB format and back again. As generalised solutions they are fairly effective, but with that generalisation comes penalties and these go way beyond slow data access.
Whilst Moore's Law may be holding true for raw processor power, data volumes are still managing to outstrip them. More problematic is that we human beings are not yet able to effectively utilise the performance of our processors. The languages we use, and their compilers and interpreters, leave too much of the bits and bytes of implementation to the human being, or we code generalised, reusable solutions to problems and sacrifice better, application specific algorithms and performance along the way.
I doubt this dilemma will resolve itself whilst we continue to hand-craft every piece of code, but equally, I don't see any sign that the current levels of reusable code designs are good enough to allow the range of real-world problems to be satisfactorily solved using a bolt-together-components-and glue approach. We also need to move beyond using foreign formats (like tables/tuples files/lines) for intermediate/persistant storage.
Ultimately, we need to be able to describe our applications at a much higher level, in terms of objects with attributes and the desired interactions between them and have compilers that read those descriptions and decide how to implement them.
The implementation produced at this level would be a logical implementation devoid of storage specifications or processor specific code. A virtual machine simulator would be used to 'run' the application for testing purposes. It would ensure that where attributes are interchanged between objects, that the objects have methods available to perform any conversions required. It would be able to generate data and generate tests to ensure code-path coverage.
Once the application has passed this level of testing, only then would it be passed to a second level compiler to perform the conversion to machine code. Selecting an appropriate internal storage format for each attribute, depending in part upon the word size etc. of the target processor, partly on usage information produced by the virtual machine runs. So, for example, if an attribute is numeric but it's most frequent use is for display purposes, then it would be stored in ASCII (or unicode) and only converted to binary when necessary, or manipulated using BCD or whatever. Conversely, if the attribute is predominantly used for math, it would be maintained in binary, the size of the storage used determined on the basis of the virtual machine runs.
Once the processor specific code has been compiled, the data and test scenarios generated by the VM compiler are automatically re-run and machine specific optimisations performed on the basis of dynamic analysis rather than the simplistic static analysis that current compilers perform. Further, automatically generated logging and tracing could be used to monitor the performance of the application over it's lifetime and if the nature of it's data or use changes in way that indicate re-optimisation is required, then that could be achieved by re-generating the processor/ platform specific code from the compiled VM form.
So I dream :)