Re^2: A faster?, safer, user transparent, shared variable "locking" mechanism.

roboticus Thanks for your response on the technical limitations of the idea. It's much appreciated.

... you can execute a simple locking construct in just a few ticks.

I'm interested to know what kind of "simple locking mechanism" you are alluding to?

At the assembler level, there are the bittest&reset opcodes (BTR), and other cheap (intra-process-only) mutex mechanisms that can be utilised, but these are not directly available to the C programmer and can often not be used.

There is a rather old, but still interesting article comparing NT threads with Solaris thread at usenix.org.

Here's a quote from that paper:

The poor performance of NT's mutex was directly attributed to its implementation. NT's mutex is a kernel object that has many security attributes that are used to secure its global status. NT's critical section is a simple user object that only calls the kernel when there is contention and a thread must either wait or awaken. Thus, its stronger performance was due to the elimination of the overhead associated with the global mutex.

Of those mechanism available under NT (pre-Vista), I chose the critical section because it is the cheapest for the no-contention case as it avoids a call into the kernel. This means negligible impact on normal, non-shared variables.

I should also point out that critical sections are already used internally by win32 threaded builds of Perl. The problem is that user locks based around the pthreads cond_* apis are implemented using NT Semaphores. And the cause of the lamentable performance of locking on shared data in P5/iThreads is the way (and timing) of how these are used. In many cases, both mechanisms need to be deployed to achieve concurrency safety of shared variables, with the following impact:

cmpthese -1, {
    shared => q[ 
        my %h : shared = 1..10;                   ++$h{ 1 }   for 1 ..
+ 1e4 
    ],
    shared_locked_hv => q[ 
        my %h:shared   = 1 .. 10; do{ lock( %h ); ++$h{ 1 } } for 1 ..
+ 1e4 
    ],
    non_shared    => q[ 
        my %h          = 1 .. 10;                 ++$h{ 1 }   for 1 ..
+ 1e4 
    ],
};;
                  Rate shared_locked_hv           shared       non_sha
+red
shared_locked_hv 145/s               --             -59%             -
+60%
shared           353/s             144%               --              
+-1%
non_shared       358/s             147%               1%              
+ --
[download]

As you can see, the impact of the internal (critsec) locking on a shared hash compared to a non-shared hash is minimal at 1%. However, once you apply user locking to the shared hash, the impact goes up markedly with performance being more than halved. And remember, the above is a single-threaded test, so there can be no contention or wait states involved.

If the need for user-level locks, and/or the impact of them in the non-contention case could be alleviated, that would go a long way to removing the impact of threaded builds on non-threaded code.

... Just getting to the exception handler took many times more ticks.

Agreed. But again, you only transit into the exception handler in the event of contention on shared data. For the non-contention, and non-shared cases, the exception handler is not invoke and so there is no impact at all.

Then doing the page table manipulations: those were/are expensive operations also.

Again, page table manipulations are only needed in the shared data case. In this case, at least one call into the kernel is required anyway, to acquire a mutex along with the (potential) wait. The impact of a call to reset the PAGE_GUARD attribute is fairly inconsequential to this path.

Another page attribute that might lend itself to the purpose is MEM_WRITE_WATCH. This is the basis of the NT implementation of COW memory. A similar mechanism is used by *nix OSs to implement the famous COW sharing of data between forks, which appears to operate pretty efficiently.

I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad.

There were/are three motivations for thinking about this:

The P5/iThreads synchronisation mechanism has a measurable impact upon non-threaded code using non-shared data.
From memory, performance of Perl 5.8 built with threads is around 15% slower (I've found other references that say 30%) than without. This constitutes a very real and concrete hook for the no threads brigade to hang their hats on. I believe a good part of the performance impact on non-threaded, non-shared code is attributable to the need to test whether locking is required.
If the need for locking was detected through memory attributes and an exception handler, then it has no impact if the handler is not invoked, other than the installation of the handler which is minimal.
The removal of (some of) the need for the Perl programmer to have to handle locking explicitly.
With perl's fat data structures, even a simple scalar is a multi-field struct that can require internal writes for Perl level read-only accesses, it is necessary for perl to protect the internals from concurrent access at all times.
Many Perl level accesses of shared scalars that intuitively do not require synchronisation, actually do require it.
Eg. Pre-incrementing a shared variable, ++$shared; does require user level locking.
As that pre-increment could require the conversion of a string to integer, perls internals already have to synchronise around that access. For these 'simple', short-lived locking requirements, it seems to me that it ought to be possible to extend the internally (required) locks around the user code and remove the need for additional (and duplicated?) locking.
As it is clear internally which opcodes are mutating (in the user sense), applying transparent locking at the opcode (tree) level seems to make sense. As has been shown above, this is not a panacea to the requirement for user locks, but I still believe it could go some way to reducing the user level requirement and also the overhead.
User level locks, cond_signal, cond_wait etc. are implemented using NT semaphores which always require a call into the kernel, whereas the internal synchronisation is based around critical sections. If the latter could be utilised to remove some need for the former, there is a potential performance win as well as a possible simplification benefit.
If it could be demonstrated that some of this is implementable and has positive benefit in Perl 5, it might encourage the Parrot guys to take it more seriously in the design of threading there.

So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^2: A faster?, safer, user transparent, shared variable "locking" mechanism. Select or Download Code

Replies are listed 'Best First'.
Re^3: A faster?, safer, user transparent, shared variable "locking" mechanism. by roboticus (Chancellor) on Nov 02, 2006 at 12:07 UTC
BrowserUk ... you can execute a simple locking construct in just a few ticks. I'm interested to know what kind of "simple locking mechanism" you are alluding to? At the assembler level, there are the bittest&reset opcodes (BTR), and other cheap (intra-process-only) mutex mechanisms that can be utilised, but these are not directly available to the C programmer and can often not be used. Many C/C++ compilers provide an _asm keyword (or equivalent) to give access to the various atomic instructions you can use for mutexes and such. However, I'll grant you that one and simply postulate that some kind author provided a library routine to do the job (and give a way a few ticks for stack frame creation/teardown). Then I'll generously grant you the same facility to provide your C/C++ program the ability to run the instructions to alter the page tables and such. And while you correctly point out that my initial response *did* concentrate too much on the timing consequences of where the exception was taken, there are two other aspects that I should've mentioned. First: The page table instructions are slow, too. IIRC, they're slower than the simple locks. Secondly: Even if they've gotten faster, they're priviledged instructions. So in order for Perl to use them, it either has to switch to kernel mode to execute the instructions, then switch back to user mode to run. Two context switches are expensive to pay for every variable modification, even if the exception never triggers. Adding in the fact that you'd have to find a way for Perl to get the permissions to do so for all platforms concerned is just another problem to deal with... ... Just getting to the exception handler took many times more ticks. Agreed. But again, you only transit into the exception handler in the event of contention on shared data. For the non-contention, and non-shared cases, the exception handler is not invoke and so there is no impact at all. As I read your original post, I thought you were suggesting that you would get the exception each time you *write* to the variable, rather than each time you get contention for it. For a read-only (or read-nearly-always) data structure, I can see where this could be a win. But for a frequently-written value, I don't see how you can get the speed you need with the requirement to have two context switches for each write. I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad. You may well be correct there ... I don't know what iThreads uses for locking. Does Perl use iThreads? So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against. Well, it's clear that you've put *far* more thought into it than I have, so I'm willing to be proven wrong. (Actually, since I'm a speed junkie, I'd love it! A faster Perl would be a boon for all.) And, yes, I was relying on the simple user-space locks, assuming that you're sharing the data with other threads in the same program... Interesting ... I'll have to chew on it for a while and see what I come up with..... roboticus	[reply]
Re^4: A faster?, safer, user transparent, shared variable "locking" mechanism. by BrowserUk (Patriarch) on Nov 02, 2006 at 12:55 UTC
yes, I was relying on the simple user-space locks, assuming that you're sharing the data with other threads in the same program... This is an excellent point. And one I've visited before. I posed a similar solution to the Parrot team, that of using purely user space locking rather than OS/Kernel provided locking mechanisms for thread synchronisation, a couple of years ago. Back then, the idea was dismissed out of hand. I accepted this at that time on the basis that Parrot has to run everywhere--including lots of exotic architectures (SMP, NUMA et al.), on many different processors, and under many different OSs; stuff which I simply wasn't conversant with or able to get up to speed on--so maybe there were places where user-space locking based around atomic instructions accessing shared, user-space memory simply wasn't possible. Your bringing this up again has rekindled that spark of a question and I'm now wondering whether there are really any OS/Archtecture/cpu combinations that would be possible targets for Parrot and capable of running threads that don't provide the necessary primatives to support user-space-only locks? That's a difficult question to answer and even harder to prove. There is also the rather simple expedient of hiding the locking behind macros (or prefereably functions, given my experience of trying to unravel Perl5 macros when things don't work). On those platforms where user-space locks are either not available or not easily providable, then the macros use OS provided mechanisms. Thinking about this has resurrected another earlier notion that fell by the wayside as a result of having to defend the overall proposals I put forward. Simultaneously having to defend against the "Redmond just took a unix idea and renamed it" and "that's a Redmond-only solution" :( The idea that you use a pool of mutexs keyed by the data (address) they are protecting, and reusing them cyclically. This was something I used years ago to improve the granularity of locking whilst avoiding the memory inflation of one mutex per shared data item. It worked well back then and was very tunable. Hm. Despite the amount of thought I've already expended on this, I can see it occupying a lot more. Trouble is, most of it will probably be wasted time as nothing I say is likely to have much influence. I'd be more than interested to hear the results of your further deliberations, even if it's only academic interest :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^5: A faster?, safer, user transparent, shared variable "locking" mechanism. by roboticus (Chancellor) on Nov 04, 2006 at 03:33 UTC
I'd definitely persue user-space locking, for speed if nothing else. `/me wonders why the Parrotistas dismissed the idea....` I'm not terribly familiar with processors other than x86 and a few 8-bitters, but I've lurked on the linux kernel mailing list for a couple of years. I've not heard of any of the locking techniques in the kernel relying on being executed in a protected context--it seems that they normally rely on atomic operations. (Though there were a couple of interesting threads a few years back that talked about "memory barriers" ... perhaps that might be related. The memory barriers were for enforcing that prior R/W operations actually made it into memory before the next series of R/W operations.) My suspicion is that doing locking in a protected context is normally done merely to ensure that the actual mutexes/condition variables/etc. are in a protected area so an errant memcpy() can't lock everyone out of the system. WRT NUMA, etc., I would imagine that the cache coherency in the system would protect you from most multi-cpu interactions--as long as you keep your mutex in the same cache line as the value or handle of the variable you're protecting. I don't think it's wasted time to ruminate on these issues. It sharpens the brain for other tasks. And sometimes you come up with the right idea and make something better. If only I could find my ASM backups ... I could dig up a few routines I used in a robotics job and dust 'em off and play with 'em a bit. Ah, well... --roboticus	[reply] [d/l]