Re: Re: Re: Re: A (memory) poor man's hash

By booting a minimal configuration of my OS, I managed to persuade Devel::Size to report the size/total_size of the standard hash on my machine and it confirms your figures of ~= 32 & 44 MB respectively. However, I also used a ram disc to reduce my memory capacity in increments and found that 98 MB is the minimum required to create the 1 Million key hash. Any less and with swapping turned off, an out-of memory error occurs before completion.

My OP was purely an exploration of an idea. The intent was to engender discussion re: it's merits versus other mechanisms available or that might be made available for use from P5 in the near term. It's unfortunate that rather than getting to discuss the idea and how it might used or implemented, I ended up having to defend my non-existant "attack" on perl's hashes and justify my claims regarding memory consumption. C'est la vie.

I have now implemented a low-memory hash based on the idea I described in the original post. It is a real hash, ensuring no duplication and supporting values as well as keys. It has limitations, it doesn't support storage of references and keys cannot contain one character used internally (currently ascii 255). Like Tie::SubstrHash, it uses scalars to store the keys and values. However, it doesn't require fixed sized keys/values or table size. These all grow dynamically as required. It currently accepts a single extra parameter at the tie which is used to determine which part of the key is used for indexing.

Currently it uses 16 MB to store the one million keys with undef values. Insertion takes 500 secs and traversal 380 seconds. Almost all of the degradation relative to my original tests is due to the overheads of the tie mechanism. It's these penalties incurred when extending P5's generic datatypes, whether through tieing or OO that make me such an enthusiast for P6 (or maybe Ponie as an interim).

I'm still testing, optimising and documenting, prior to making it available for people to evaluate. Like all new code, it will probably be fragile until it has been exercised for a while in some 'real' applications, but that doesn't put me off from trying new/different approaches to solving problems.

I didn't say that I consider cache optimisation unimportant. I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform. In the documentation for the Judy arrays, they note that the laboriously hand-crafted L1 cache optimisations they developed for it will probably not be (as) beneficial on IA32 or some RISC processors which indicates part of the problem.

Even when correctly targeted, the benefits will often be transitory, as there is a second part to the problem. The Judy arrays are designed to allow large portions of the data structures to be searched/ traversed whilst avoid cache fill delays (on the targeted processor(s)), but the optimisations are only effective whilst the cache remains coherent.

If the application using them traverses the entire structure from a very localised piece of code that doesn't call other code that would cause the caches to be overwritten, then it will fully benefit from the optimisation. But if the code traversing the data structure, calls other code that accesses a different data structure each time through the loop, then the cache will need to be re-filled for each element of the array.

And that's the crux of the problem, cache optimisations only ever work at the local level, but the code utilising the optimised structures is rarely so confined. One piece of code using two highly cache optimised structures alternately will destroy the optimisations as it switches between them. The moment you have multiple applications using the same cache, there is no guarantee that the cache will remain coherent for even a single element access. The task will be interrupted randomly by the scheduler unless extraordinary measure (CritSecs or similar) are deployed to combat this, and whichever other task is scheduled next will overwrite the cache.

I did read somewhere about a concept of runtime optimisation performed by micro-code, but if I recall correctly this was aimed at highly parallelised algorithms like FFT used in weather analysis and similar high-volume, repetative-data applications. I don't recall whether the concept was actually being implemented, or if it was just blue sky, but I have my doubts as to it's applicability to generalise processing and processors -- least-wise given the current state of processor development.

I'm not particularly impressed with DB's in general nor RDBMS's in particular. They force applications to structure their data in formats to fit the database, require the use of a language totally abstracted from the application to access it, and finally force conversions to/ from the applications format to the DB format and back again. As generalised solutions they are fairly effective, but with that generalisation comes penalties and these go way beyond slow data access.

Whilst Moore's Law may be holding true for raw processor power, data volumes are still managing to outstrip them. More problematic is that we human beings are not yet able to effectively utilise the performance of our processors. The languages we use, and their compilers and interpreters, leave too much of the bits and bytes of implementation to the human being, or we code generalised, reusable solutions to problems and sacrifice better, application specific algorithms and performance along the way.

I doubt this dilemma will resolve itself whilst we continue to hand-craft every piece of code, but equally, I don't see any sign that the current levels of reusable code designs are good enough to allow the range of real-world problems to be satisfactorily solved using a bolt-together-components-and glue approach. We also need to move beyond using foreign formats (like tables/tuples files/lines) for intermediate/persistant storage.

Ultimately, we need to be able to describe our applications at a much higher level, in terms of objects with attributes and the desired interactions between them and have compilers that read those descriptions and decide how to implement them.

The implementation produced at this level would be a logical implementation devoid of storage specifications or processor specific code. A virtual machine simulator would be used to 'run' the application for testing purposes. It would ensure that where attributes are interchanged between objects, that the objects have methods available to perform any conversions required. It would be able to generate data and generate tests to ensure code-path coverage.

Once the application has passed this level of testing, only then would it be passed to a second level compiler to perform the conversion to machine code. Selecting an appropriate internal storage format for each attribute, depending in part upon the word size etc. of the target processor, partly on usage information produced by the virtual machine runs. So, for example, if an attribute is numeric but it's most frequent use is for display purposes, then it would be stored in ASCII (or unicode) and only converted to binary when necessary, or manipulated using BCD or whatever. Conversely, if the attribute is predominantly used for math, it would be maintained in binary, the size of the storage used determined on the basis of the virtual machine runs.

Once the processor specific code has been compiled, the data and test scenarios generated by the VM compiler are automatically re-run and machine specific optimisations performed on the basis of dynamic analysis rather than the simplistic static analysis that current compilers perform. Further, automatically generated logging and tracing could be used to monitor the performance of the application over it's lifetime and if the nature of it's data or use changes in way that indicate re-optimisation is required, then that could be achieved by re-generating the processor/ platform specific code from the compiled VM form.

So I dream :)

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

Comment on Re: Re: Re: Re: A (memory) poor man's hash

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: A (memory) poor man's hash by tilly (Archbishop) on Nov 25, 2003 at 03:51 UTC
On the low-memory hashing. I'm sure that what you are doing is possible. I'm sure that it saves memory. At a minor hit to performance you can get rid of the one illegal character limitation as follows: `# Two useful auxilliary hashes my %trans_in = ( "\xFD" => "\xFD", "\xFE" => "\xFD\xFE", "\xFF" => "\xFD\xFE\xFE", ); my %trans_out = reverse %trans_in; # On the way in... $string =~ s/(\xFD\|\xFE\|\xFF)/$trans_in{$1}/g; # On the way out $string =~ s/(\xFD\xFE*)/$trans_out{$1}/g;` [download] With cache optimization, we need to specify our goals first. If your goal is to achieve a universal large win, or to achieve any kind of ideal optimization, then optimizing cache coherency is an impossible ideal. But that isn't my goal. My goal would be to have data structures which will tend to perform better. And that is quite possible. Sure, Judy arrays perform best in code that is limited to just working with the array in question. It also will perform better on chips that it has been tuned to, and tuning it to different chips is a never-ending process. However the fact that you have paid attention to how cache-friendly your datastructure is can win. Perhaps when you are working simultaneously with 3 Judy arrays you have each one ruining the cache coherency of the others. But what that means is that you are ageing data out of caches earlier than expected. So instead of finding something in level 1 cache it might be in level 2 cache. This is still a win over using a hash with its tendancy to blow all caches, all the time. Similarly it is clear that code that was tuned to one set of cache sizes won't work as well with a different set of cache sizes. However it is still a win. Furthermore my claim is that different exponents in Moore's law for improvements in chip speed versus improvements in the rate of data transfer means that what was a win will tend to become a bigger win as time goes by. Sure, something else might now be optimal. But pretending that access times are flat will lose by more in the future than it does now. On relational databases, I think that you are missing the boat. Sure, a relational database pushes programmers to restructure their data structures and push logic to the database. But the reason why it does that is that when you take that step, the database on the fly figures out far better algorithms than programmers normally would manage to figure out for themselves. Sure, there are a smidgeon of programmers who can beat the relational database. But nobody can beat it consistently without doing a lot of work. (Just consider the amount of reprogramming that you have to do to match what a DBA accomplishes by adding the right index to speed up existing queries.) And in addition to the speed win, nobody in their right mind thinks that they can match the transactional mechanics that database provides by rolling their own at the application level. Your dream of effectively utilizing the performance of our processors is the diametric opposite of my dream, and runs counter to every trend in programming. In various aspects of programming (and life) we make many trade-offs. As the cost of any particular factor drops relative to the others, the natural tendancy is to be willing to waste more of that factor to save on some or all of the others. If you like, you can think of this as a result of some sort of generalized Le Chatelier's Principle. I certainly do. This can be carried to amazing lengths. For instance a merchant from 200 years ago would be astounded at the speed with which we routinely ship small parcels from New York City to Boston. Said merchant would also be flabbergasted at the idea of possibly routing said packages through Baltimore. But it makes sense to do so today since the incremental costs of transportating things farther have fallen to a point where we are willing to trade absurd amounts of it for the efficiencies of centralized sorting and routing. In programming this means that the natural response to having more processor speed to work with is not to figure out how to squeeze more speed out to achieve some ideal level of performance. Rather it is to view CPU performance as cheap and start trading it away for everything else that we can. If we could get performance and everything else that we might want, that would be perfect. But your dream falls aground on the limits of the Halting Problem, you simply cannot compute a perfect static analysis. Oh, you can do heuristics (every optimizer does), and you can do better heuristics with runtime data. (Transmeta attempts to do so with their code-morphing software.) But those attempts add latency issues (if only for the latency to realize when the usage pattern changes), and will work worse and worse as you move to more and more dynamic techniques. Well that isn't entirely true. There are ways to program that allow optimization to be done on the fly (if only to pieces of your program) to make things run really well. Of course making that work means that programmers have to jump a few hoops, and hand off problems to the computer for it to take final control over. Which can work well for both sides, the computer gets well-defined problems that it can work with, and humans get to define the hard bits. I'm not sure that you would like that programming direction though. The most successful example of that method of cooperation is relational databases. Which you don't really like. But they do exactly what you want, right down to automatically generating logging and tracing to allow you to monitor and tune the database. In many cases people have the database set up to recalculate statistics periodically and improve its execution paths based on what the current datasets and usage. (Freeing humans to worry less about the best algorithms, and allowing us to focus on higher order pieces.)	[reply] [d/l]
Re: Re: Re: Re: Re: Re: A (memory) poor man's hash by BrowserUk (Patriarch) on Nov 25, 2003 at 06:33 UTC
Nah. Using BOMs won't work. For the rest -- you appear to take the greatest delight in finding that interpretation that is as close to 180 degrees opposed as is possible without shifting to a completely new subject. You don't counter the arguements put forward. Instead, you introduce a subject vaguely related to the original subject matter, open with an obvious counter to a non-sequita not in in discussion, and then support that obvious arguement at length, with the implication that if you said "it is", then your opponent must have already said it isn't. Shame! Positive arguments are so much more productive than negative ones. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]
Re: Re: Re: Re: Re: Re: Re: A (memory) poor man's hash by tilly (Archbishop) on Nov 25, 2003 at 15:58 UTC
I'm guessing that by BOM you mean Byte-Order Marking. (This based on a google search for an unknown acronym.) Which probably refers to the character escaping suggestion that I gave. Please define what you mean by, "won't work". Won't work as in doesn't do what you want? Possible, I don't know what you want. Won't work as in doesn't do what I said it does? That's another story. It does exactly what I said. If you want to allow your key/value pairs to be able to hold arbitrary binary data in your datastructure, pre and post process them as I described and you will succeed. The preprocessing gets rid of the character that you are using as a separator. The postprocessing recovers the original string data. Adding any processing takes time, so there is going to be some performance hit. My guess is not much of one since Perl's RE engine is pretty fast and the REs in question are pretty simple. So what you summarize by, Nah. Using BOMs won't work. actually works exactly like I said it did. As for the rest, please be specific in your complaints. Vaguely waving hands doesn't give me much feedback. Here is a specific example to illustrate what I mean. [...]I didn't say that I consider cache optimisation unimportant. I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform.[...] [...]With cache optimization, we need to specify our goals first. If your goal is to achieve a universal large win, or to achieve any kind of ideal optimization, then optimizing cache coherency is an impossible ideal. But that isn't my goal. My goal would be to have data structures which will tend to perform better. And that is quite possible.[...] [...]. Instead, you introduce a subject vaguely related to the original subject matter, open with an obvious counter to a non-sequita not in in discussion, and then support that obvious arguement at length, with the implication that if you said "it is", then your opponent must have already said it isn't.[...] I could have taken that particular thread back further, but that is far enough. From my point of view, your phrase in a meaningful way is unclear in the extreme. I don't know what you mean by that. I know what I would mean by that, and it clearly isn't what you mean because I come to opposite conclusions. So I took pains to explain exactly how I would understand that phrase, and why my understanding leads me to a different conclusion than you came to. My hope was that by making it clear exactly where and why we differ in our views that we could clarify the difference in our perspectives. But it seems that you have misinterpreted that as being a negative argument against you. :-( I don't think that basic facts are really in dispute. Let me summarize them. Something like Judy arrays attempt to dynamically optimize themselves to account for the cost of cache misses. I think that we agree on the following facts about Judy arrays: Making something like Judy arrays work takes a lot of work. The specific tradeoffs made by Judy arrays will work far better on some CPUs than others. Judy arrays can beat hashing on many different CPUs. Judy arrays can be used for pretty much the same things that hashing is used. I have more claims that I don't know whether you agree with. Even where usage patterns aren't exactly what Judy arrays are optimized best for, they are likely to be a win. Current Moore's law trends indicate that the ratio between how well Judy arrays and hashing perform will temd towards being more in favour of Judy arrays in future generations of chips. I suspect that the ratio between Judy arrays and an ideally designed data structure for the chip at hand will get worse over time. Now my perspective of these claims is that replacing the black box of hashing with the black box of something like Judy arrays can be a meaningful and practical cache optimization for most purposes for crossplatform development even though it is not specifically tuned to the target platform. Which is exactly what you claimed to doubt. OK, more exactly you stated, I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform. As I see it there are a few possible causes for that disagreement: Your perspective of what is "meaningful" is different than mine. You hadn't considered one or more of those claims. You think that one or more of those claims is wrong. I still would like to understand that disagreement. I can guess. My guess is that we have very different aims when it comes to performance, so while I'm happy with a trivially achieved modest win, you are unhappy with anything less than the really major wins that you can see are possible, albeit with a lot of work. But I'm not sure of that guess, and I really don't understand the value system which makes performance that big of a goal.	[reply]
Re: Re: Re: Re: Re: Re: Re: Re: A (memory) poor man's hash by BrowserUk (Patriarch) on Nov 26, 2003 at 09:46 UTC


Perl-Sensitive Sunglasses
	PerlMonks