in reply to A faster?, safer, user transparent, shared variable "locking" mechanism.
I see quite a few interesting perl-internals and computer science reasons discussed as to why that would be difficult (if not impossible) to do well. However, I hadn't seen any x86 arguments.
I used to be an x86 assembly programmer, so here's my take on it. (Note: my last asm programming was for a Pentium Pro, so the newer chips may have much-improved timings. Though I doubt that it would be enough to help...)
In those days, at any rate, you can execute a simple locking construct in just a few ticks. Just getting to the exception handler took many times more ticks. Then doing the page table manipulations: those were/are expensive operations also. So even if you are able to solve the other issues raised ... it would be slower rather than faster. It could make for shorter, simpler code for the user ... but timing wise, it's a dog.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: A faster?, safer, user transparent, shared variable "locking" mechanism.
by BrowserUk (Patriarch) on Nov 01, 2006 at 09:25 UTC | |
roboticus Thanks for your response on the technical limitations of the idea. It's much appreciated. ... you can execute a simple locking construct in just a few ticks. I'm interested to know what kind of "simple locking mechanism" you are alluding to? At the assembler level, there are the bittest&reset opcodes (BTR), and other cheap (intra-process-only) mutex mechanisms that can be utilised, but these are not directly available to the C programmer and can often not be used. There is a rather old, but still interesting article comparing NT threads with Solaris thread at usenix.org. Here's a quote from that paper:
Of those mechanism available under NT (pre-Vista), I chose the critical section because it is the cheapest for the no-contention case as it avoids a call into the kernel. This means negligible impact on normal, non-shared variables. I should also point out that critical sections are already used internally by win32 threaded builds of Perl. The problem is that user locks based around the pthreads cond_* apis are implemented using NT Semaphores. And the cause of the lamentable performance of locking on shared data in P5/iThreads is the way (and timing) of how these are used. In many cases, both mechanisms need to be deployed to achieve concurrency safety of shared variables, with the following impact:
As you can see, the impact of the internal (critsec) locking on a shared hash compared to a non-shared hash is minimal at 1%. However, once you apply user locking to the shared hash, the impact goes up markedly with performance being more than halved. And remember, the above is a single-threaded test, so there can be no contention or wait states involved. If the need for user-level locks, and/or the impact of them in the non-contention case could be alleviated, that would go a long way to removing the impact of threaded builds on non-threaded code. ... Just getting to the exception handler took many times more ticks. Agreed. But again, you only transit into the exception handler in the event of contention on shared data. For the non-contention, and non-shared cases, the exception handler is not invoke and so there is no impact at all. Then doing the page table manipulations: those were/are expensive operations also. Again, page table manipulations are only needed in the shared data case. In this case, at least one call into the kernel is required anyway, to acquire a mutex along with the (potential) wait. The impact of a call to reset the PAGE_GUARD attribute is fairly inconsequential to this path. Another page attribute that might lend itself to the purpose is MEM_WRITE_WATCH. This is the basis of the NT implementation of COW memory. A similar mechanism is used by *nix OSs to implement the famous COW sharing of data between forks, which appears to operate pretty efficiently. I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad. There were/are three motivations for thinking about this:
So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by roboticus (Chancellor) on Nov 02, 2006 at 12:07 UTC | |
... you can execute a simple locking construct in just a few ticks.Many C/C++ compilers provide an _asm keyword (or equivalent) to give access to the various atomic instructions you can use for mutexes and such. However, I'll grant you that one and simply postulate that some kind author provided a library routine to do the job (and give a way a few ticks for stack frame creation/teardown). Then I'll generously grant you the same facility to provide your C/C++ program the ability to run the instructions to alter the page tables and such. And while you correctly point out that my initial response did concentrate too much on the timing consequences of where the exception was taken, there are two other aspects that I should've mentioned. First: The page table instructions are slow, too. IIRC, they're slower than the simple locks. Secondly: Even if they've gotten faster, they're priviledged instructions. So in order for Perl to use them, it either has to switch to kernel mode to execute the instructions, then switch back to user mode to run. Two context switches are expensive to pay for every variable modification, even if the exception never triggers. Adding in the fact that you'd have to find a way for Perl to get the permissions to do so for all platforms concerned is just another problem to deal with...
... Just getting to the exception handler took many times more ticks.As I read your original post, I thought you were suggesting that you would get the exception each time you write to the variable, rather than each time you get contention for it. For a read-only (or read-nearly-always) data structure, I can see where this could be a win. But for a frequently-written value, I don't see how you can get the speed you need with the requirement to have two context switches for each write.
I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad.You may well be correct there ... I don't know what iThreads uses for locking. Does Perl use iThreads?
So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against.Well, it's clear that you've put far more thought into it than I have, so I'm willing to be proven wrong. (Actually, since I'm a speed junkie, I'd *love* it! A faster Perl would be a boon for all.) And, yes, I was relying on the simple user-space locks, assuming that you're sharing the data with other threads in the same program... Interesting ... I'll have to chew on it for a while and see what I come up with..... | [reply] |
by BrowserUk (Patriarch) on Nov 02, 2006 at 12:55 UTC | |
yes, I was relying on the simple user-space locks, assuming that you're sharing the data with other threads in the same program... This is an excellent point. And one I've visited before. I posed a similar solution to the Parrot team, that of using purely user space locking rather than OS/Kernel provided locking mechanisms for thread synchronisation, a couple of years ago. Back then, the idea was dismissed out of hand. I accepted this at that time on the basis that Parrot has to run everywhere--including lots of exotic architectures (SMP, NUMA et al.), on many different processors, and under many different OSs; stuff which I simply wasn't conversant with or able to get up to speed on--so maybe there were places where user-space locking based around atomic instructions accessing shared, user-space memory simply wasn't possible. Your bringing this up again has rekindled that spark of a question and I'm now wondering whether there are really any OS/Archtecture/cpu combinations that would be possible targets for Parrot and capable of running threads that don't provide the necessary primatives to support user-space-only locks? That's a difficult question to answer and even harder to prove. There is also the rather simple expedient of hiding the locking behind macros (or prefereably functions, given my experience of trying to unravel Perl5 macros when things don't work). On those platforms where user-space locks are either not available or not easily providable, then the macros use OS provided mechanisms. Thinking about this has resurrected another earlier notion that fell by the wayside as a result of having to defend the overall proposals I put forward. Simultaneously having to defend against the "Redmond just took a unix idea and renamed it" and "that's a Redmond-only solution" :( The idea that you use a pool of mutexs keyed by the data (address) they are protecting, and reusing them cyclically. This was something I used years ago to improve the granularity of locking whilst avoiding the memory inflation of one mutex per shared data item. It worked well back then and was very tunable. Hm. Despite the amount of thought I've already expended on this, I can see it occupying a lot more. Trouble is, most of it will probably be wasted time as nothing I say is likely to have much influence. I'd be more than interested to hear the results of your further deliberations, even if it's only academic interest :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by roboticus (Chancellor) on Nov 04, 2006 at 03:33 UTC | |