http://qs1969.pair.com?node_id=309017


in reply to A (memory) poor man's <strike>hash</strike> lookup table.

First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it. On 5.8, initializing a large hash with undef $foo{$_} for 1..1_000_000; takes about 69 MB. Initializing with undef @foo{1..1_000_000}; runs faster but takes about 130 MB. (This is run on Perl 5.8.1 on Linux, YMMV.) I'm guessing that the difference is how much buffering I am doing.

The second tip. If you are doing to use dbm with a lot of data, always, always, always use a B_TREE. On a faster machine than yours (about a factor of 10), my insert time was 12 seconds. So that is 2 minutes. See the docs for a better overview and if you need to tune it more, see here in addition.

Now I'll grant you that walking keys close to in order is pretty much a best case scenario, but very few real-world scenarios don't have considerable locality of reference, so B_TREE accesses tend to win big on datasets of under a few hundred million.

And this leads to a general note. Hashes are probably the fastest general-purpose access method when you have a flat memory model. And they serve Perl very well. But the real world is moving more and more towards multiple tiers of data storage with very different speeds between them. (In decreasing order of speed: on CPU we have level 1, 2, and 3 caches, followed by main RAM, NUMA adds a few layers here, then we have local hard drives, attached storage devices, network drives, then stuff remote to the LAN.) Given differentials in Moore's law, this trend is likely to continue growing in importance as the ratios between the speeds of these layers get more and more extreme, making alternatives to hashing make more and more sense.

Replies are listed 'Best First'.
Re: Re: A (memory) poor man's hash
by BrowserUk (Patriarch) on Nov 22, 2003 at 08:00 UTC
    First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it....

    The difference in the memory consumed by the 2 methods you show is that with undef @foo{ 1 .. 1_000_000 ];, the extra memory is consumed by generating the list 1 .. 1_000_000 prior to generating the hash slice.

    When you use ...for 1 .. 1_000_000;, no list is generated as for has an optimisation to allow integer ranges to act as iterators rather than list generators.

    The method I used was

    $hash{ $_ } = undef for 1 .. 1_000_000;
    but on my machine,
    undef $foo{ $_ } for 1 .. 1_000_000;
    results in an identical memory consumption. Prior to the allocation of the hash, the perl process has 3332k allocated to it. After the allocation, it has 98184k allocated. Hence my assertion that the memory required is 95 MB. I can't explain the difference between the 95 MB I am seeing and the 69 MB you reported.

    I tried using Devel::Size to verify the memory allocation the OS is reporting is actually being used by the hash rather than some portion of it being intermediate working storage that would be available to the process for other purposes afterwards, but calling either size( \%hash ); or total_size( \%hash ); both doubled the memory consumed by the process, pushing it beyond both my physical and virtual memory capacity before segfaulting.


    Using DB_TREE rather than DB_HASH does speed up the insertion considerably.

    use DB_File; tie %hash, 'DB_File', 'tree1M.db' , O_RDWR|O_CREAT, 0666, $DB_BTREE or die $!; print time; $hash{ $_ } = undef for 1 .. 1_000_000; print time; 1069482082 1069482433 # 351 secs (5.85 mins) $n=0; print time; exists $hash{ $_ } and $n++ for 1 .. 1_000_000; print time +, $/, $n; 1069482523 1069482799 # 276 secs (4.6 mins) 1000000

    It reduces the insertion time from an hour to 5 minutes and the iteration time from 15 minutes to 4 at the cost of increasing the size of the database file from roughly twice the flat file size to 4x.

    If you are only using the hash for it's fast "does it exist" property, then using the custom hash mechanism I described contrasts favourably with the insertion taking 16 seconds and the iteration 44.

    print time; $hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000; $hash{$_} .= $/ + for keys %hash; print time; 1069484550 1069484566 # 16 seconds. $n=0; print time; for ( 1 .. 1_000_000 ) { next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/ . $_ . $/ ); $n++; } print time, $/, $n; 1069484612 1069484656 # 44 secs. 1000000

    With respect to caching. I was recently playing with prime number generators/testers and came across some discussion about techniques for optimising code (C and assembler) to make best use of L1 and L2 caching. I'll admit that much of the discussion was over my head, but my conclusion was that compilers will have to become much clever than they currently are before general code will fully benefit from these.

    The current trend to making caches larger and larger will mitigate this somewhat, but the depth of understanding required to fully utilise these caches and tailor algorithms to them is so great, and so processor specific, that I doubt that it will ever be more than a niche exercise.

    Whether processors will ever get to the point where the microcode is capable of utilising out-of-order execution, re-ordering the pipelines, or re-writing the instruction caches to optimise cache hit rates at anything beyond the keyhole scale I'm not sure. Given that the disposable processor you get in musical birthday cards has more processing power in its 4mm2 of silicon that ENIAC did in its 200k watt consuming 30 tons, who knows what a "computer" will be capable of in another 60 years :)

    It's just a shame I won't be around to see it.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

      I thought of the overhead of building the list, but the difference was more than I would think that that takes into account. But I did not measure it. However I just used Devel::Size and it reported 32083260 as the size of the hash and 44083260 as the total_size. (Regardless of how I built that hash.) My guess is that the rest is extra buffering on the part of various memory requests. Part of which is going to be the OS buffering Perl's mallocs, which is likely to be highly OS dependent, and possibly order of request dependent as well.

      On the implementation, note that the performance can be improved both by moving from tied data structures to OO access and again (probably more) by using documented tuning parameters. Additionally with a dataset this small, it may be acceptable to switch back to a hash stored in memory (since BerkeleyDB's hashing takes less memory than Perl's). Furthermore the fact that it is general-purpose makes me more willing to take that approach rather than constructing a special-purpose solution that is possibly fragile.

      And on caching. I agree heartily that optimizing for perfection is only going to be a niche exercise. I disagree that optimizing for better use of multi-level memory is unimportant though. I predict that you will see programmers over time being pushed towards data structures that are readily tuned to local memory configurations, dynamically if possible. You could view the popularity of databases as a step in this direction. A far bigger step is Judy arrays. Sure, it is very complicated to implement, but you only need to implement once. (Well, Perl would need to reimplement for licensing reasons...) And from the end programmer perspective, it is just a question of replacing the black box of a hash by the black box of another associative array implementation and voila! Things run faster!

      A few more years of increasing discrepancies between access times, and the difference will become bigger still. Eventually it will be big enough that common scripting languges will make the switch. Most programmers will never need to understand what actually happened, any more than we really understand most of the ongoing changes that make Moore's law tick in the first place.

        By booting a minimal configuration of my OS, I managed to persuade Devel::Size to report the size/total_size of the standard hash on my machine and it confirms your figures of ~= 32 & 44 MB respectively. However, I also used a ram disc to reduce my memory capacity in increments and found that 98 MB is the minimum required to create the 1 Million key hash. Any less and with swapping turned off, an out-of memory error occurs before completion.

        My OP was purely an exploration of an idea. The intent was to engender discussion re: it's merits versus other mechanisms available or that might be made available for use from P5 in the near term. It's unfortunate that rather than getting to discuss the idea and how it might used or implemented, I ended up having to defend my non-existant "attack" on perl's hashes and justify my claims regarding memory consumption. C'est la vie.

        I have now implemented a low-memory hash based on the idea I described in the original post. It is a real hash, ensuring no duplication and supporting values as well as keys. It has limitations, it doesn't support storage of references and keys cannot contain one character used internally (currently ascii 255). Like Tie::SubstrHash, it uses scalars to store the keys and values. However, it doesn't require fixed sized keys/values or table size. These all grow dynamically as required. It currently accepts a single extra parameter at the tie which is used to determine which part of the key is used for indexing.

        Currently it uses 16 MB to store the one million keys with undef values. Insertion takes 500 secs and traversal 380 seconds. Almost all of the degradation relative to my original tests is due to the overheads of the tie mechanism. It's these penalties incurred when extending P5's generic datatypes, whether through tieing or OO that make me such an enthusiast for P6 (or maybe Ponie as an interim).

        I'm still testing, optimising and documenting, prior to making it available for people to evaluate. Like all new code, it will probably be fragile until it has been exercised for a while in some 'real' applications, but that doesn't put me off from trying new/different approaches to solving problems.


        I didn't say that I consider cache optimisation unimportant. I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform. In the documentation for the Judy arrays, they note that the laboriously hand-crafted L1 cache optimisations they developed for it will probably not be (as) beneficial on IA32 or some RISC processors which indicates part of the problem.

        Even when correctly targeted, the benefits will often be transitory, as there is a second part to the problem. The Judy arrays are designed to allow large portions of the data structures to be searched/ traversed whilst avoid cache fill delays (on the targeted processor(s)), but the optimisations are only effective whilst the cache remains coherent.

        If the application using them traverses the entire structure from a very localised piece of code that doesn't call other code that would cause the caches to be overwritten, then it will fully benefit from the optimisation. But if the code traversing the data structure, calls other code that accesses a different data structure each time through the loop, then the cache will need to be re-filled for each element of the array.

        And that's the crux of the problem, cache optimisations only ever work at the local level, but the code utilising the optimised structures is rarely so confined. One piece of code using two highly cache optimised structures alternately will destroy the optimisations as it switches between them. The moment you have multiple applications using the same cache, there is no guarantee that the cache will remain coherent for even a single element access. The task will be interrupted randomly by the scheduler unless extraordinary measure (CritSecs or similar) are deployed to combat this, and whichever other task is scheduled next will overwrite the cache.

        I did read somewhere about a concept of runtime optimisation performed by micro-code, but if I recall correctly this was aimed at highly parallelised algorithms like FFT used in weather analysis and similar high-volume, repetative-data applications. I don't recall whether the concept was actually being implemented, or if it was just blue sky, but I have my doubts as to it's applicability to generalise processing and processors -- least-wise given the current state of processor development.

        I'm not particularly impressed with DB's in general nor RDBMS's in particular. They force applications to structure their data in formats to fit the database, require the use of a language totally abstracted from the application to access it, and finally force conversions to/ from the applications format to the DB format and back again. As generalised solutions they are fairly effective, but with that generalisation comes penalties and these go way beyond slow data access.


        Whilst Moore's Law may be holding true for raw processor power, data volumes are still managing to outstrip them. More problematic is that we human beings are not yet able to effectively utilise the performance of our processors. The languages we use, and their compilers and interpreters, leave too much of the bits and bytes of implementation to the human being, or we code generalised, reusable solutions to problems and sacrifice better, application specific algorithms and performance along the way.

        I doubt this dilemma will resolve itself whilst we continue to hand-craft every piece of code, but equally, I don't see any sign that the current levels of reusable code designs are good enough to allow the range of real-world problems to be satisfactorily solved using a bolt-together-components-and glue approach. We also need to move beyond using foreign formats (like tables/tuples files/lines) for intermediate/persistant storage.

        Ultimately, we need to be able to describe our applications at a much higher level, in terms of objects with attributes and the desired interactions between them and have compilers that read those descriptions and decide how to implement them.

        The implementation produced at this level would be a logical implementation devoid of storage specifications or processor specific code. A virtual machine simulator would be used to 'run' the application for testing purposes. It would ensure that where attributes are interchanged between objects, that the objects have methods available to perform any conversions required. It would be able to generate data and generate tests to ensure code-path coverage.

        Once the application has passed this level of testing, only then would it be passed to a second level compiler to perform the conversion to machine code. Selecting an appropriate internal storage format for each attribute, depending in part upon the word size etc. of the target processor, partly on usage information produced by the virtual machine runs. So, for example, if an attribute is numeric but it's most frequent use is for display purposes, then it would be stored in ASCII (or unicode) and only converted to binary when necessary, or manipulated using BCD or whatever. Conversely, if the attribute is predominantly used for math, it would be maintained in binary, the size of the storage used determined on the basis of the virtual machine runs.

        Once the processor specific code has been compiled, the data and test scenarios generated by the VM compiler are automatically re-run and machine specific optimisations performed on the basis of dynamic analysis rather than the simplistic static analysis that current compilers perform. Further, automatically generated logging and tracing could be used to monitor the performance of the application over it's lifetime and if the nature of it's data or use changes in way that indicate re-optimisation is required, then that could be achieved by re-generating the processor/ platform specific code from the compiled VM form.

        So I dream :)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!