Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: Re: Re: Re: Re: Re: Re: A (memory) poor man's hash

by tilly (Archbishop)
on Nov 25, 2003 at 15:58 UTC ( #309940=note: print w/replies, xml ) Need Help??

in reply to Re: Re: Re: Re: Re: Re: A (memory) poor man's hash
in thread A (memory) poor man's <strike>hash</strike> lookup table.

I'm guessing that by BOM you mean Byte-Order Marking. (This based on a google search for an unknown acronym.) Which probably refers to the character escaping suggestion that I gave.

Please define what you mean by, "won't work". Won't work as in doesn't do what you want? Possible, I don't know what you want. Won't work as in doesn't do what I said it does? That's another story. It does exactly what I said.

If you want to allow your key/value pairs to be able to hold arbitrary binary data in your datastructure, pre and post process them as I described and you will succeed. The preprocessing gets rid of the character that you are using as a separator. The postprocessing recovers the original string data. Adding any processing takes time, so there is going to be some performance hit. My guess is not much of one since Perl's RE engine is pretty fast and the REs in question are pretty simple.

So what you summarize by, Nah. Using BOMs won't work. actually works exactly like I said it did.

As for the rest, please be specific in your complaints. Vaguely waving hands doesn't give me much feedback.

Here is a specific example to illustrate what I mean.

[...]I didn't say that I consider cache optimisation unimportant. I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform.[...]
[...]With cache optimization, we need to specify our goals first. If your goal is to achieve a universal large win, or to achieve any kind of ideal optimization, then optimizing cache coherency is an impossible ideal. But that isn't my goal. My goal would be to have data structures which will tend to perform better. And that is quite possible.[...]
[...]. Instead, you introduce a subject vaguely related to the original subject matter, open with an obvious counter to a non-sequita not in in discussion, and then support that obvious arguement at length, with the implication that if you said "it is", then your opponent must have already said it isn't.[...]
I could have taken that particular thread back further, but that is far enough.

From my point of view, your phrase in a meaningful way is unclear in the extreme. I don't know what you mean by that. I know what I would mean by that, and it clearly isn't what you mean because I come to opposite conclusions. So I took pains to explain exactly how I would understand that phrase, and why my understanding leads me to a different conclusion than you came to.

My hope was that by making it clear exactly where and why we differ in our views that we could clarify the difference in our perspectives. But it seems that you have misinterpreted that as being a negative argument against you. :-(

I don't think that basic facts are really in dispute. Let me summarize them. Something like Judy arrays attempt to dynamically optimize themselves to account for the cost of cache misses. I think that we agree on the following facts about Judy arrays:

  1. Making something like Judy arrays work takes a lot of work.
  2. The specific tradeoffs made by Judy arrays will work far better on some CPUs than others.
  3. Judy arrays can beat hashing on many different CPUs.
  4. Judy arrays can be used for pretty much the same things that hashing is used.
I have more claims that I don't know whether you agree with.
  1. Even where usage patterns aren't exactly what Judy arrays are optimized best for, they are likely to be a win.
  2. Current Moore's law trends indicate that the ratio between how well Judy arrays and hashing perform will temd towards being more in favour of Judy arrays in future generations of chips.
  3. I suspect that the ratio between Judy arrays and an ideally designed data structure for the chip at hand will get worse over time.
Now my perspective of these claims is that replacing the black box of hashing with the black box of something like Judy arrays can be a meaningful and practical cache optimization for most purposes for crossplatform development even though it is not specifically tuned to the target platform. Which is exactly what you claimed to doubt. OK, more exactly you stated, I do doubt that it is possible, in a meaningful way, for cross-platform development, or even practical for most purposes unless it is performed by compilers or interpreters tuned to the target platform.

As I see it there are a few possible causes for that disagreement:

  • Your perspective of what is "meaningful" is different than mine.
  • You hadn't considered one or more of those claims.
  • You think that one or more of those claims is wrong.
I still would like to understand that disagreement. I can guess. My guess is that we have very different aims when it comes to performance, so while I'm happy with a trivially achieved modest win, you are unhappy with anything less than the really major wins that you can see are possible, albeit with a lot of work.

But I'm not sure of that guess, and I really don't understand the value system which makes performance that big of a goal.

  • Comment on Re: Re: Re: Re: Re: Re: Re: A (memory) poor man's hash

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Re: Re: A (memory) poor man's hash
by BrowserUk (Patriarch) on Nov 26, 2003 at 09:46 UTC

    Okay Tilly, if you want to play the game of revising history through selective quotes, let's do that. Your first reply contained 3 points.

    Reply 1

    • First an important correction.

      This is not a correction. As explained and demonstrated in subsequent posts, creating a 1 million key hash, using perl 5.8, under NT4 on my machine requires at least 98 MB. Regardless of whether any of the difference between the 32/44 MB returned by Devel::Size is available for use by the rest of the perl program or even if some of it might get returned to the OS under low memory situations. The memory has to be there in the first place in order to create the hash.

      But the damage is done. Tilly says that it only takes 60 MB, which makes the rest of the statistics in the OP are therefore suspect.

    • ...always, always, always use a B_TREE.

      A good tip if I was interested in using DB_File.

      Even better if the DB_File documentation (which IMO is pants!) suggested this.

      Better yet if this was recommended each time DB_File is offered as a solution to the "out of memory" problem when using large hashes here at PM.

    • And this leads to a general note.... But the real world is moving more and more towards multiple tiers of data storage....

      Does this imply that Perl doesn't live in the real world? Or only me?

      Perl uses hashes extensively internally, as well as recommending them for a large percentage of it's cookbook solutions. On the one hand, my opening paragraphs, designed to make it clear that I was not denigrating perl's hash mechanisms and implementation, is taken as denigration. On the other, you take it as a reason to imply that if I knew better, then I would be so enamoured with them and would be using something different.

    To summarise and paraphrase.

    Your statistics are bad, DB_File is perfect, and if you knew better, you wouldn't be so enamoured with hashes in the first place because they are old hat and destined to be replaced.

    Does any of this help in solving or smoothing the performance knee that arises for perl 5 programmer dealing with large keyed data sets today?

    Despite this, I attempted to address each of your points seriously in a constructive and open-minded manner.

    Reply 2

    I disagree that optimising for better use of multi-level memory is unimportant though....

    Can you point to where I said such optimisation was "unimportant"?

    I did say that such optimisations were

    1. Very processor specific.
    2. Required in-depth knowledge of the caching strategies of the specific CPU and possibly memory management strategies of the OS.
    3. Was a task that took even experts considerable time and effort to perform, even for a single platform.
    4. Had dubious benefits, especially from the viewpoint of the VHL programmer utilising a multi-processing environment, as they do not exercise sufficient control over the creation and execution of the machine level code in order to even begin to consider this issue in there code.
    5. Was something best left to compilers and interpreters rather than the average application programmer.

    I'm still somewhat bemused by the relevance this entire subject had relative to the OP. The technique I outlined in the OP was aimed not at performance, but at conserving memory. And at trading memory conservation for performance whilst smoothing the knee of the transition. Perl's native hashes are already as fast as they are likely to be. Maybe Judy arrays would be faster, but having spent 3 hours reading the 10-minutes "explanation" of them, which gave little or no information on how to actually implement them I moved onto the 3-hours diatribe which repetitively extols their virtues without giving any real information upon their implementation. Whilst the code may be available for download,

    • the caching optimisations are aimed specifically at one specific set of HP processors.
    • Is so complex, that it took the experts in both that code base, and the processors in question, over a year of full-time, corporate-sponsored development to complete the final revisions and optimisation for that platform.
    • Would need to be incorporated directly into the perl 5 code base in order that they would be beneficial. Attempting to provide them to perl 5 as an add-on would necessitate utilising the tie interface which would immediately throw away most of their benefits over perl's hash implementation.

      Perl's cross-platform imperative, would make the cache level optimisations almost impossible to realise in a generic form.

    Until someone is willing and capable of providing a native C implementation of Judy arrays for incorporation directly into the perl code base, Judy arrays and the cache-level performance optimisations that can, with considerable effort, be derived from them, will be nothing but a blue-sky desire for the average perl 5 programmer. Whether the performance benefits are required is questionable as perl's hashes are already pretty fast and very flexible. Whether they would in any way address the memory consumption issue is beyond my ability to predict, but it is unlikely that they will become available any time soon. It is this latter issue that I was attempting to solve.

    Your third reply

    On the low-memory hashing. I'm sure that what you are doing is possible. I'm sure that it saves memory.

    Damned, with faint praise. I know that it is possible, I said as much.

    With cache optimization, we need to specify our goals first.

    Cache optimisation is not my goal.

    ...Sure, something else might now be optimal. But pretending that access times are flat will lose by more in the future than it does now.

    I'm not pretending anything. I acknowledged the reality that perl's hashes, useful as they are now, have a limitation for certain applications. I attempted to find a solution for this limitation that could be made available now. I believed I had the germ of such a solution and meditated on it in the hope that the collective resource of PM could see a way of utilising that germ to bring a sufficiently generic solution to fruition. Now, or in the immediate future.

    The current design and implementation of Perl has many such limitations that affect particular, specific applications. Many of these are un-addressable given the current code base. P6/Parrot and Ponie offer the potential for addressing these, but these remain some time away from being available to the average perl programmer for production use.

    On relational databases, I think that you are missing the boat.

    Sorry, but the RDBMS boat sailed in the 1970's and reached it's destination in the 1980's. Little or nothing has changed on the RDBMS front since then except that their proprietary internals have been tuned in ways that are specific to the implementations --each proprietary solution vying for supremacy in each area. Whilst open-source RDBMSs are now available, they are somewhere between 10 and 20 years behind in development and many (most?) do not share the benefits of the transactional mechanics, nor efficient stored procedures nor the triggers to effectively use them.

    The only way to realise high performance from an RDBMS is to move as much of the processing into the domain of the RDBMS as possible which requires the use of triggers and stored procedures. The reason for this is that the communications between the application and the RDBMS and the required, associated locking are relatively inefficient. It therefore necessitates keeping the communication between the application and the RDBMS to a minimum. Where efficiency is a consideration, this requires the application programmer to move as much of the logic of his application into the RDBMS as possible so that as much work can be done as a result if each communication as possible. Preferably, the work should be done automatically (triggered) as a consequence of other communications (inserts) as possible. With the high-end RDBMSs, this is possible and efficient.

    The downside is that utilising these proprietary mechanisms, besides causing lock-in, requires that all applications utilising the RDBMS be written and structured in a way that is friendly to it. This includes limiting the data formats and structures used to those the RDBMS supports, or imposes the expense of converting between the applications formats and those of the RDBMS. It also requires the use of SQL (plus the myriad proprietary extensions to it).

    The abstract nature of SQL is very convenient for the RDBMS producer, but not at all for the application writer. The time spent normalising recursive data-structures, many-to-many relationships, and breaking apart/ reconstructing complex compound data structures in order to serialise them to and from an RDBMS is almost pure wasted effort in most cases. Just as hierarchal DBs were very good at representing hierarchal data, but inefficient and clumsy at handling other types, so imposing a relational view upon non-relational data is equally inefficient and clumsy.

    The number of times I have been party to the whole 2inner-join versus outer-join quandry", or the "my data is variable length and greater than 255 chars but I need to sub-select the records using a full text search but blobs/clobs/* don't support that.", and the many other issues that arise solely as a result of using RDBMSs, are part of the source of my distaste for them. The levels of abstraction that RDBMSs impose upon the application programmer, and the time spent complying with and pandering to those abstractions are nothing more than a drain upon resources.

    The only time data truly benefits from the use of such abstraction is when the data may be compounded with other data held within the same DB (Note: DB,, not RDBMS) and utilised in another application at some point in the future. This is the oft-quoted holy grail of the RDBMS, but it is so rarely actually used in the real world because most applications place their data in separate DBs. It is not possible to effectively compound the data from several applications unless there data reside in the same DB!. In order that the data from several application be effectively combined in a single DB, requires that a schema be designed that permits all required uses of the data be met efficiently. In order that the 'future, compounded data applications' requirements be met by the compound schema, requires the DBA designing the schema have knowledge of those requirements (or guess them) when designing the schema. Whilst a new schema can be devised at a later data and the existing data transformed to that new schema, the chances that the original applications will be unaffected by that transformation and not need to be re-written are very slim.

    The singularly most pervasive and persistent paradigm in computing to date is OOD & OOP, and it would be a brave man to predict either it's demise or supercedence.

    OO is not inherently relational nor is it easily transmuted into the relational abstraction. It is easy and productive to design OO-systems that contain recursive, many-to-many and inherited data-structures. Converting between these and a relational model for the purposes of persistence is slow, messy, laborious and fraught with traps for the application programmer. Whilst the rules are pretty clearly defined and a DBA fully conversant with relational theory is capable of performing the transformations required, in most shops I have been party to, the RDBMS and the departments that manage them are not set up to assist the application programmer. They are often structured such that problems encountered by the AP with his application data are considered "his problem", and the data he deposits in the RDBMS become detached from his control.

    In your example,

    (Just consider the amount of reprogramming that you have to do to match what a DBA accomplishes by adding the right index to speed up existing queries.)
    , in most cases, any speed gained by an application by the addition of the right index, would be entirely unnecessary if the data was maintained locally by the application, already so indexed.

    In fact, it is entirely possible to encode the rules for normalisation and conversion of these data structure types such that the conversion to a relational view can be performed auto magically. It should be possible for the OO-programmer to throw his objects at his central data store and have it perform whatever conversions are required to store them. Having every programmer produce code to perform these types of conversions is unproductive and risky. Structuring every application around the constraints, rules and demands of the relational model and the SQL abstraction similarly so. The purpose of an abstraction should be to hide the implementation detail from the concern of the user of the abstraction. Neither the relational model nor SQL do this. They serve only one purpose -- to simplify the task of the RDBMS writer.

    As an OO programmer, I want to be able to create an instance of an object and be given an handle to it. If that instance need to persist beyond the life of the program, I want to be able to indicate this at creation time, or invoke a method that causes that to happen. And that should be that.

    Retrieving that instance from another application should be transparent to me. I create the instance using whatever initialisation (derived from whatever source) and if that instance already exists in the DB, I should get a handle to it. If not, it should be created and I get a handle to that new instance. The process of serialisation, conversion to the underlying storage medium, normalisation, indexing etc. should be entirely transparent to me. Then, and only then, will I be able to concentrate on writing my application instead of spending my time trying to adapt the requirements of my application to the limitations, rules and specificities of the DMBS.

    That's the source of my distaste for RDMBSs. They force me, as an application programmer, to adapt my application to them, rather than the other way around. Imagine that my garage was 4 feet wide and 30 feet long and required me to split my car in two in order to park it. A crude analogy, but pertinent.

    Your dream of effectively utilizing the performance of our processors is the diametric opposite of my dream, and runs counter to every trend in programming.

    My immediate reaction to this sentence was: What is the point in having hugely fast processors if we are unable to utilise them? The dream I described was the dream that the tools we use when writing programs would perform more of the detail of the programming function freeing the human being to concentrate upon the bigger picture of design. C compilers often laud the fact that they can compile n lines of code in milliseconds. I fail to see how this performance benefits the programmer, if the processor sits nearly idle while he spends hours writing and running tests or trying to find the reasons why the code produced in those milliseconds doesn't work?

    What I was envisioning is the time when the compiler utilises the performance of the processor to automate as many of these tasks as possible. If there was a mechanism by which the programmer could clearly define the requirements, constraints and boundary conditions of his algorithms, then it would be possible for the compiler to generate test data and test code to exercise that algorithm. Perhaps my choice of example was flawed, but you introduced the cache-utilisation-optimisation scenario into the discussion, I simply ran with it. My point was that the rules for such optimisations should be encapsulated into the same, processor specific place as the mechanisms that produce the processor specific code -- the compiler, and not be the concern of the AP.

    I agree that processor performance is cheap, and given so, it should be thrown at whatever tasks can be easily automated leaving the programmer to concentrate on those tasks that aren't. It will be a while before a CPU is capable of gathering and analysing requirements, and designing applications, but the process of dynamically analysing an algorithm and applying a set of processor specific rules to enhance the performance of that algorithm is well within it's grasp. Most compilers, and even interpreters already do this to some extent. Codifying whatever rules and methods that can be used by the programmer to do this (as exemplified by Judy arrays) would be much simpler and ultimately more productive. Doing this dynamically on a virtual representation of the application, performing benchmarking and profiling, optimising data structures used by objects and algorithms is both conceivable and would be considerably more effective than the current peep-hole style optimisations that result from static analysis. If my compiler spent 10 minutes or even 10 hours rather than 10 seconds compiling my application, but performed this type of analysis and optimisation rather than my having to do it myself, I would be extremely happy.

    This is what I meant by more fully utilising the performance of my CPU. Having the tools I use perform more of the drudgery of the job, so that I can do less. This is the essence of VHLs and (IMO) the greatest benefit of Perl over C or assembler. I enjoy assembler as a hobby, just as I enjoy hand-crafting things in wood, but if cars were manufactured using nothing more than hacksaws, files, rules and scribes, then they would be far more expensive and much less developed than they are now. The secret of productivity is to have tools that do more of the work, more quickly, more accurately and with less supervision. Computers, as tools, are being applied in every area of human endeavour to this end with amazing results.

    With one, notable and lamentable exception -- software development.

Your last post

As for the rest, please be specific in your complaints. Vaguely waving hands doesn't give me much feedback.

I consider that my last post was specific, contained no hand-waving and was very targeted in it's feedback.

The whole discussion regarding cache optimisations and Judy arrays is nothing more than an entertaining blind alley with respect to solving the original problem until all those that have complained of the memory hungry nature of perl's hashes have access to an implementation of Judy arrays optimised for their platform from perl 5. I'll admit to have continued the discussion, and having moved beyond it and into even more blue-sky arenas in my attempt to be seen to maintain my (rare) intellectual interaction. Perhaps that was a mistake on my behalf, but I did attempt to segregate that discussion from the purpose of the meditation.

My complaint is that by your choosing to combine

  • unfounded negative comment on the original idea: "60 MB not 95".
  • with arguments against statements I didn't make: "optimizing for better use of multi-level memory is unimportant".
  • with alternative mechanisms which aren't available: Judy arrays.
  • with phraseology that implies your skepticism as to the utility of an implementation of the original idea: "I'm sure that what you are doing is possible. I'm sure that it saves memory."
  • With an apparent solution to a non-problem: "get rid of the one illegal character limitation".
  • with barely disguised personal insults on the basis of ideas that I never expressed: "...nobody in their right mind thinks that they can match transactional mechanics that database provides by rolling their own at the application level".
  • Along with specious analogies: "This can be carried to amazing lengths. For instance a merchant from 200 years ago would be astounded at the speed with which we routinely ship small parcels from New York City to Boston. Said merchant would also be flabbergasted at the idea of possibly routing said packages through Baltimore."

    (I'll remind you that the only mention of performance in my original post was my desire to address the memory issue without sacrificing too much performance. It was you that introduced the micro-optimisation of cache utilisation into the discussion.)

  • with self-contradictory arguments:
    Well that isn't entirely true. There are ways to program that allow optimisation to be done on the fly (if only to pieces of your program) to make things run really well. Of course making that work means that programmers have to jump a few hoops, and hand off problems to the computer for it to take final control over. Which can work well for both sides, the computer gets well-defined problems that it can work with, and humans get to define the hard bits.

    I'm not sure that you would like that programming direction though. The most successful example of that method of cooperation is relational databases. Which you don't really like. But they do exactly what you want, right down to automatically generating logging and tracing to allow you to monitor and tune the database. In many cases people have the database set up to recalculate statistics periodically and improve its execution paths based on what the current data sets and usage. (Freeing humans to worry less about the best algorithms, and allowing us to focus on higher order pieces.)

    On the one hand you say it isn't possible or is undesirable due to the hoops it makes it makes the programmer jump through. On the other hand you say that the relational databases already do this better than anything else.

    My argument is that the hoops that RDBMSs make the programmer jump through create more problems than they solve.And that many of the problems that RDBMSs reportedly solve, arise purely as a side-effect of using RDBMS's. It is the very nature of the beasts that creates most of the problems that require the great expense of high-end RDBMSs to solve. Many of the problems they claim to solve only exist because they force the application programmer to adapt the natural order of his data to their constraints.

    However, that some high-end RDBMSs can perform automate dynamic analysis and optimisation means that the techniques and heuristics required are possible. All that is required is to target those techniques and heuristics at solving the problems encountered by the AP in meeting the application requirements rather than targeting them at solving the problems created adapting those requirements to the rules and constraints of the RDBMS.

This combined with your reputation as (one of) PM's most valuable contributors means that my careful analysis and the time I spent documenting it becomes completely diluted by the FUD, to the point where the effort that went in to it, and the possible outcome are demerited beyond viability.

Through one piece of misinformation and arguments against several non-sequitous statements that I didn't make, you have effectively rendered the efforts I made to finding a solution to the original problem, which is real enough that a super-search of this site brings up many examples of it, completely discredited. As such, the possible solution will never see the light of day, and I will be even more reluctant to ever proffer my ideas in this forum again.

I make no claims to my abilities nor suggest that my opinions have any merit. I'm a coder pure and simple. I have tried very hard to limit the expression of my ideas to those areas where I was able to back up those ideas with measurable and demonstrable facts. This is not the first time that my expression of simple, verifiably, statements of fact, combined with a directly measurable solution, having been attacked and demerited on the basis of unsubstantiated doubts expressed by a reputable monk.

It will however, be the last time -- which was probably the aim.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://309940]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2023-12-03 02:00 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (20 votes). Check out past polls.