http://qs1969.pair.com?node_id=309128


in reply to Re: A (memory) poor man's hash
in thread A (memory) poor man's <strike>hash</strike> lookup table.

First an important correction. The amount of memory taken to store a hash varies widely depending on how you do it....

The difference in the memory consumed by the 2 methods you show is that with undef @foo{ 1 .. 1_000_000 ];, the extra memory is consumed by generating the list 1 .. 1_000_000 prior to generating the hash slice.

When you use ...for 1 .. 1_000_000;, no list is generated as for has an optimisation to allow integer ranges to act as iterators rather than list generators.

The method I used was

$hash{ $_ } = undef for 1 .. 1_000_000;
but on my machine,
undef $foo{ $_ } for 1 .. 1_000_000;
results in an identical memory consumption. Prior to the allocation of the hash, the perl process has 3332k allocated to it. After the allocation, it has 98184k allocated. Hence my assertion that the memory required is 95 MB. I can't explain the difference between the 95 MB I am seeing and the 69 MB you reported.

I tried using Devel::Size to verify the memory allocation the OS is reporting is actually being used by the hash rather than some portion of it being intermediate working storage that would be available to the process for other purposes afterwards, but calling either size( \%hash ); or total_size( \%hash ); both doubled the memory consumed by the process, pushing it beyond both my physical and virtual memory capacity before segfaulting.


Using DB_TREE rather than DB_HASH does speed up the insertion considerably.

use DB_File; tie %hash, 'DB_File', 'tree1M.db' , O_RDWR|O_CREAT, 0666, $DB_BTREE or die $!; print time; $hash{ $_ } = undef for 1 .. 1_000_000; print time; 1069482082 1069482433 # 351 secs (5.85 mins) $n=0; print time; exists $hash{ $_ } and $n++ for 1 .. 1_000_000; print time +, $/, $n; 1069482523 1069482799 # 276 secs (4.6 mins) 1000000

It reduces the insertion time from an hour to 5 minutes and the iteration time from 15 minutes to 4 at the cost of increasing the size of the database file from roughly twice the flat file size to 4x.

If you are only using the hash for it's fast "does it exist" property, then using the custom hash mechanism I described contrasts favourably with the insertion taking 16 seconds and the iteration 44.

print time; $hash{ substr $_, 0, 4 } .= $/. $_ for 1 .. 1_000_000; $hash{$_} .= $/ + for keys %hash; print time; 1069484550 1069484566 # 16 seconds. $n=0; print time; for ( 1 .. 1_000_000 ) { next unless exists $hash{ substr $_, 0, 4 } and 1+index( $hash{ substr $_, 0, 4 }, $/ . $_ . $/ ); $n++; } print time, $/, $n; 1069484612 1069484656 # 44 secs. 1000000

With respect to caching. I was recently playing with prime number generators/testers and came across some discussion about techniques for optimising code (C and assembler) to make best use of L1 and L2 caching. I'll admit that much of the discussion was over my head, but my conclusion was that compilers will have to become much clever than they currently are before general code will fully benefit from these.

The current trend to making caches larger and larger will mitigate this somewhat, but the depth of understanding required to fully utilise these caches and tailor algorithms to them is so great, and so processor specific, that I doubt that it will ever be more than a niche exercise.

Whether processors will ever get to the point where the microcode is capable of utilising out-of-order execution, re-ordering the pipelines, or re-writing the instruction caches to optimise cache hit rates at anything beyond the keyhole scale I'm not sure. Given that the disposable processor you get in musical birthday cards has more processing power in its 4mm2 of silicon that ENIAC did in its 200k watt consuming 30 tons, who knows what a "computer" will be capable of in another 60 years :)

It's just a shame I won't be around to see it.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

Replies are listed 'Best First'.
Re: Re: Re: A (memory) poor man's hash
by tilly (Archbishop) on Nov 22, 2003 at 14:55 UTC
    I thought of the overhead of building the list, but the difference was more than I would think that that takes into account. But I did not measure it. However I just used Devel::Size and it reported 32083260 as the size of the hash and 44083260 as the total_size. (Regardless of how I built that hash.) My guess is that the rest is extra buffering on the part of various memory requests. Part of which is going to be the OS buffering Perl's mallocs, which is likely to be highly OS dependent, and possibly order of request dependent as well.

    On the implementation, note that the performance can be improved both by moving from tied data structures to OO access and again (probably more) by using documented tuning parameters. Additionally with a dataset this small, it may be acceptable to switch back to a hash stored in memory (since BerkeleyDB's hashing takes less memory than Perl's). Furthermore the fact that it is general-purpose makes me more willing to take that approach rather than constructing a special-purpose solution that is possibly fragile.

    And on caching. I agree heartily that optimizing for perfection is only going to be a niche exercise. I disagree that optimizing for better use of multi-level memory is unimportant though. I predict that you will see programmers over time being pushed towards data structures that are readily tuned to local memory configurations, dynamically if possible. You could view the popularity of databases as a step in this direction. A far bigger step is Judy arrays. Sure, it is very complicated to implement, but you only need to implement once. (Well, Perl would need to reimplement for licensing reasons...) And from the end programmer perspective, it is just a question of replacing the black box of a hash by the black box of another associative array implementation and voila! Things run faster!

    A few more years of increasing discrepancies between access times, and the difference will become bigger still. Eventually it will be big enough that common scripting languges will make the switch. Most programmers will never need to understand what actually happened, any more than we really understand most of the ongoing changes that make Moore's law tick in the first place.

      By booting a minimal configuration of my OS, I managed to persuade Devel::Size to report the size/total_size of the standard hash on my machine and it confirms your figures of ~= 32 & 44 MB respectively. However, I also used a ram disc to reduce my memory capacity in increments and found that 98 MB is the minimum required to create the 1 Million key hash. Any less and with swapping turned off, an out-of memory error occurs before completion.

      My OP was purely an exploration of an idea. The intent was to engender discussion re: it's merits versus other mechanisms available or that might be made available for use from P5 in the near term. It's unfortunate that rather than getting to discuss the idea and how it might used or implemented, I ended up having to defend my non-existant "attack" on perl's hashes and justify my claims regarding memory consumption. C'est la vie.

      I have now implemented a low-memory hash based on the idea I described in the original post. It is a real hash, ensuring no duplication and supporting values as well as keys. It has limitations, it doesn't support storage of references and keys cannot contain one character used internally (currently ascii 255). Like Tie::SubstrHash, it uses scalars to store the keys and values. However, it doesn't require fixed sized keys/values or table size. These all grow dynamically as required. It currently accepts a single extra parameter at the tie which is used to determine which part of the key is used for indexing.

      Currently it uses 16 MB to store the one million keys with undef values. Insertion takes 500 secs and traversal 380 seconds. Almost all of the degradation relative to my original tests is due to the overheads of the tie mechanism. It's these penalties incurred when extending P5's generic datatypes, whether through tieing or OO that make me such an enthusiast for P6 (or maybe Ponie as an interim).

      I'm still testing, optimising and documenting, prior to making it available for people to evaluate. Like all new code, it will probably be fragile until it has been exercised for a while in some 'real' applications, but that doesn't put me off from trying new/different approaches to solving problems.


      I didn't say that I consider cache optimisation unimportant. I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform. In the documentation for the Judy arrays, they note that the laboriously hand-crafted L1 cache optimisations they developed for it will probably not be (as) beneficial on IA32 or some RISC processors which indicates part of the problem.

      Even when correctly targeted, the benefits will often be transitory, as there is a second part to the problem. The Judy arrays are designed to allow large portions of the data structures to be searched/ traversed whilst avoid cache fill delays (on the targeted processor(s)), but the optimisations are only effective whilst the cache remains coherent.

      If the application using them traverses the entire structure from a very localised piece of code that doesn't call other code that would cause the caches to be overwritten, then it will fully benefit from the optimisation. But if the code traversing the data structure, calls other code that accesses a different data structure each time through the loop, then the cache will need to be re-filled for each element of the array.

      And that's the crux of the problem, cache optimisations only ever work at the local level, but the code utilising the optimised structures is rarely so confined. One piece of code using two highly cache optimised structures alternately will destroy the optimisations as it switches between them. The moment you have multiple applications using the same cache, there is no guarantee that the cache will remain coherent for even a single element access. The task will be interrupted randomly by the scheduler unless extraordinary measure (CritSecs or similar) are deployed to combat this, and whichever other task is scheduled next will overwrite the cache.

      I did read somewhere about a concept of runtime optimisation performed by micro-code, but if I recall correctly this was aimed at highly parallelised algorithms like FFT used in weather analysis and similar high-volume, repetative-data applications. I don't recall whether the concept was actually being implemented, or if it was just blue sky, but I have my doubts as to it's applicability to generalise processing and processors -- least-wise given the current state of processor development.

      I'm not particularly impressed with DB's in general nor RDBMS's in particular. They force applications to structure their data in formats to fit the database, require the use of a language totally abstracted from the application to access it, and finally force conversions to/ from the applications format to the DB format and back again. As generalised solutions they are fairly effective, but with that generalisation comes penalties and these go way beyond slow data access.


      Whilst Moore's Law may be holding true for raw processor power, data volumes are still managing to outstrip them. More problematic is that we human beings are not yet able to effectively utilise the performance of our processors. The languages we use, and their compilers and interpreters, leave too much of the bits and bytes of implementation to the human being, or we code generalised, reusable solutions to problems and sacrifice better, application specific algorithms and performance along the way.

      I doubt this dilemma will resolve itself whilst we continue to hand-craft every piece of code, but equally, I don't see any sign that the current levels of reusable code designs are good enough to allow the range of real-world problems to be satisfactorily solved using a bolt-together-components-and glue approach. We also need to move beyond using foreign formats (like tables/tuples files/lines) for intermediate/persistant storage.

      Ultimately, we need to be able to describe our applications at a much higher level, in terms of objects with attributes and the desired interactions between them and have compilers that read those descriptions and decide how to implement them.

      The implementation produced at this level would be a logical implementation devoid of storage specifications or processor specific code. A virtual machine simulator would be used to 'run' the application for testing purposes. It would ensure that where attributes are interchanged between objects, that the objects have methods available to perform any conversions required. It would be able to generate data and generate tests to ensure code-path coverage.

      Once the application has passed this level of testing, only then would it be passed to a second level compiler to perform the conversion to machine code. Selecting an appropriate internal storage format for each attribute, depending in part upon the word size etc. of the target processor, partly on usage information produced by the virtual machine runs. So, for example, if an attribute is numeric but it's most frequent use is for display purposes, then it would be stored in ASCII (or unicode) and only converted to binary when necessary, or manipulated using BCD or whatever. Conversely, if the attribute is predominantly used for math, it would be maintained in binary, the size of the storage used determined on the basis of the virtual machine runs.

      Once the processor specific code has been compiled, the data and test scenarios generated by the VM compiler are automatically re-run and machine specific optimisations performed on the basis of dynamic analysis rather than the simplistic static analysis that current compilers perform. Further, automatically generated logging and tracing could be used to monitor the performance of the application over it's lifetime and if the nature of it's data or use changes in way that indicate re-optimisation is required, then that could be achieved by re-generating the processor/ platform specific code from the compiled VM form.

      So I dream :)


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

        On the low-memory hashing. I'm sure that what you are doing is possible. I'm sure that it saves memory. At a minor hit to performance you can get rid of the one illegal character limitation as follows:
        # Two useful auxilliary hashes my %trans_in = ( "\xFD" => "\xFD", "\xFE" => "\xFD\xFE", "\xFF" => "\xFD\xFE\xFE", ); my %trans_out = reverse %trans_in; # On the way in... $string =~ s/(\xFD|\xFE|\xFF)/$trans_in{$1}/g; # On the way out $string =~ s/(\xFD\xFE*)/$trans_out{$1}/g;

        With cache optimization, we need to specify our goals first. If your goal is to achieve a universal large win, or to achieve any kind of ideal optimization, then optimizing cache coherency is an impossible ideal. But that isn't my goal. My goal would be to have data structures which will tend to perform better. And that is quite possible.

        Sure, Judy arrays perform best in code that is limited to just working with the array in question. It also will perform better on chips that it has been tuned to, and tuning it to different chips is a never-ending process. However the fact that you have paid attention to how cache-friendly your datastructure is can win. Perhaps when you are working simultaneously with 3 Judy arrays you have each one ruining the cache coherency of the others. But what that means is that you are ageing data out of caches earlier than expected. So instead of finding something in level 1 cache it might be in level 2 cache. This is still a win over using a hash with its tendancy to blow all caches, all the time.

        Similarly it is clear that code that was tuned to one set of cache sizes won't work as well with a different set of cache sizes. However it is still a win. Furthermore my claim is that different exponents in Moore's law for improvements in chip speed versus improvements in the rate of data transfer means that what was a win will tend to become a bigger win as time goes by. Sure, something else might now be optimal. But pretending that access times are flat will lose by more in the future than it does now.

        On relational databases, I think that you are missing the boat. Sure, a relational database pushes programmers to restructure their data structures and push logic to the database. But the reason why it does that is that when you take that step, the database on the fly figures out far better algorithms than programmers normally would manage to figure out for themselves. Sure, there are a smidgeon of programmers who can beat the relational database. But nobody can beat it consistently without doing a lot of work. (Just consider the amount of reprogramming that you have to do to match what a DBA accomplishes by adding the right index to speed up existing queries.)

        And in addition to the speed win, nobody in their right mind thinks that they can match the transactional mechanics that database provides by rolling their own at the application level.


        Your dream of effectively utilizing the performance of our processors is the diametric opposite of my dream, and runs counter to every trend in programming.

        In various aspects of programming (and life) we make many trade-offs. As the cost of any particular factor drops relative to the others, the natural tendancy is to be willing to waste more of that factor to save on some or all of the others. If you like, you can think of this as a result of some sort of generalized Le Chatelier's Principle. I certainly do.

        This can be carried to amazing lengths. For instance a merchant from 200 years ago would be astounded at the speed with which we routinely ship small parcels from New York City to Boston. Said merchant would also be flabbergasted at the idea of possibly routing said packages through Baltimore. But it makes sense to do so today since the incremental costs of transportating things farther have fallen to a point where we are willing to trade absurd amounts of it for the efficiencies of centralized sorting and routing.

        In programming this means that the natural response to having more processor speed to work with is not to figure out how to squeeze more speed out to achieve some ideal level of performance. Rather it is to view CPU performance as cheap and start trading it away for everything else that we can.

        If we could get performance and everything else that we might want, that would be perfect. But your dream falls aground on the limits of the Halting Problem, you simply cannot compute a perfect static analysis. Oh, you can do heuristics (every optimizer does), and you can do better heuristics with runtime data. (Transmeta attempts to do so with their code-morphing software.) But those attempts add latency issues (if only for the latency to realize when the usage pattern changes), and will work worse and worse as you move to more and more dynamic techniques.

        Well that isn't entirely true. There are ways to program that allow optimization to be done on the fly (if only to pieces of your program) to make things run really well. Of course making that work means that programmers have to jump a few hoops, and hand off problems to the computer for it to take final control over. Which can work well for both sides, the computer gets well-defined problems that it can work with, and humans get to define the hard bits.

        I'm not sure that you would like that programming direction though. The most successful example of that method of cooperation is relational databases. Which you don't really like. But they do exactly what you want, right down to automatically generating logging and tracing to allow you to monitor and tune the database. In many cases people have the database set up to recalculate statistics periodically and improve its execution paths based on what the current datasets and usage. (Freeing humans to worry less about the best algorithms, and allowing us to focus on higher order pieces.)