comment on

roboticus Thanks for your response on the technical limitations of the idea. It's much appreciated.

... you can execute a simple locking construct in just a few ticks.

I'm interested to know what kind of "simple locking mechanism" you are alluding to?

At the assembler level, there are the bittest&reset opcodes (BTR), and other cheap (intra-process-only) mutex mechanisms that can be utilised, but these are not directly available to the C programmer and can often not be used.

There is a rather old, but still interesting article comparing NT threads with Solaris thread at usenix.org.

Here's a quote from that paper:

The poor performance of NT's mutex was directly attributed to its implementation. NT's mutex is a kernel object that has many security attributes that are used to secure its global status. NT's critical section is a simple user object that only calls the kernel when there is contention and a thread must either wait or awaken. Thus, its stronger performance was due to the elimination of the overhead associated with the global mutex.

Of those mechanism available under NT (pre-Vista), I chose the critical section because it is the cheapest for the no-contention case as it avoids a call into the kernel. This means negligible impact on normal, non-shared variables.

I should also point out that critical sections are already used internally by win32 threaded builds of Perl. The problem is that user locks based around the pthreads cond_* apis are implemented using NT Semaphores. And the cause of the lamentable performance of locking on shared data in P5/iThreads is the way (and timing) of how these are used. In many cases, both mechanisms need to be deployed to achieve concurrency safety of shared variables, with the following impact:

cmpthese -1, {
    shared => q[ 
        my %h : shared = 1..10;                   ++$h{ 1 }   for 1 ..
+ 1e4 
    ],
    shared_locked_hv => q[ 
        my %h:shared   = 1 .. 10; do{ lock( %h ); ++$h{ 1 } } for 1 ..
+ 1e4 
    ],
    non_shared    => q[ 
        my %h          = 1 .. 10;                 ++$h{ 1 }   for 1 ..
+ 1e4 
    ],
};;
                  Rate shared_locked_hv           shared       non_sha
+red
shared_locked_hv 145/s               --             -59%             -
+60%
shared           353/s             144%               --              
+-1%
non_shared       358/s             147%               1%              
+ --
[download]

As you can see, the impact of the internal (critsec) locking on a shared hash compared to a non-shared hash is minimal at 1%. However, once you apply user locking to the shared hash, the impact goes up markedly with performance being more than halved. And remember, the above is a single-threaded test, so there can be no contention or wait states involved.

If the need for user-level locks, and/or the impact of them in the non-contention case could be alleviated, that would go a long way to removing the impact of threaded builds on non-threaded code.

... Just getting to the exception handler took many times more ticks.

Agreed. But again, you only transit into the exception handler in the event of contention on shared data. For the non-contention, and non-shared cases, the exception handler is not invoke and so there is no impact at all.

Then doing the page table manipulations: those were/are expensive operations also.

Again, page table manipulations are only needed in the shared data case. In this case, at least one call into the kernel is required anyway, to acquire a mutex along with the (potential) wait. The impact of a call to reset the PAGE_GUARD attribute is fairly inconsequential to this path.

Another page attribute that might lend itself to the purpose is MEM_WRITE_WATCH. This is the basis of the NT implementation of COW memory. A similar mechanism is used by *nix OSs to implement the famous COW sharing of data between forks, which appears to operate pretty efficiently.

I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad.

There were/are three motivations for thinking about this:

The P5/iThreads synchronisation mechanism has a measurable impact upon non-threaded code using non-shared data.
From memory, performance of Perl 5.8 built with threads is around 15% slower (I've found other references that say 30%) than without. This constitutes a very real and concrete hook for the no threads brigade to hang their hats on. I believe a good part of the performance impact on non-threaded, non-shared code is attributable to the need to test whether locking is required.
If the need for locking was detected through memory attributes and an exception handler, then it has no impact if the handler is not invoked, other than the installation of the handler which is minimal.
The removal of (some of) the need for the Perl programmer to have to handle locking explicitly.
With perl's fat data structures, even a simple scalar is a multi-field struct that can require internal writes for Perl level read-only accesses, it is necessary for perl to protect the internals from concurrent access at all times.
Many Perl level accesses of shared scalars that intuitively do not require synchronisation, actually do require it.
Eg. Pre-incrementing a shared variable, ++$shared; does require user level locking.
As that pre-increment could require the conversion of a string to integer, perls internals already have to synchronise around that access. For these 'simple', short-lived locking requirements, it seems to me that it ought to be possible to extend the internally (required) locks around the user code and remove the need for additional (and duplicated?) locking.
As it is clear internally which opcodes are mutating (in the user sense), applying transparent locking at the opcode (tree) level seems to make sense. As has been shown above, this is not a panacea to the requirement for user locks, but I still believe it could go some way to reducing the user level requirement and also the overhead.
User level locks, cond_signal, cond_wait etc. are implemented using NT semaphores which always require a call into the kernel, whereas the internal synchronisation is based around critical sections. If the latter could be utilised to remove some need for the former, there is a potential performance win as well as a possible simplification benefit.
If it could be demonstrated that some of this is implementable and has positive benefit in Perl 5, it might encourage the Parrot guys to take it more seriously in the design of threading there.

So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^2: A faster?, safer, user transparent, shared variable "locking" mechanism. by BrowserUk
in thread A faster?, safer, user transparent, shared variable "locking" mechanism. by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.