Re: A faster?, safer, user transparent, shared variable "locking" mechanism.
by ikegami (Patriarch) on Oct 26, 2006 at 07:48 UTC
|
Such a mechanism cannot replace user-level locks. (By that I mean something like thread::shared::lock.)
-
It doesn't allow the locking to span more than one op, so operations which both read and write to a variable (such as $s = $s + 1) still result in a race condition.
$s = $s + 1
+=====================================================+
| CPU |
+--------------------------+--------------------------+
| thread 1 | thread 2 |
+==========================+==========================+
| - ... | |
| - copy $s into temp | |
| - $s is locked | |
| - copy $s into temp | |
| - $s is unlocked | |
| - add one to temp | |
+--------------------------+--------------------------+
| | - ... |
| | - copy $s into temp |
| | - $s is locked |
| | - copy $s into temp |
| | - $s is unlocked |
| | - add one to temp |
| | - assign temp to $s |
| | - $s is locked |
| | - assign temp to $s |
| | - $s is unlocked |
| | - ... |
+--------------------------+--------------------------+
| - assign temp to $s | |
| - $s is locked | |
| - assign temp to $s | |
| - $s is unlocked | |
| - ... | |
+==========================+==========================+
=> $s is wrong <=
-
The lock is only active for the duration of the write, so it can't be used to control access to multiple variables using a single lock variable.
-
Accessing a shared variable would halt all threads (in our or every application?), even if they are running code unrelated to the variable being accessed.
-
Unless assignments are made atomic by a finer, hidden locking mechanism, read accesses would also need to lock the SV. For example, copying a PVIV into a non-shared SV would require reading the IV, PV and PV length atomically. However, this is easily fixable by extending the protection to catch all accesses, not just write accesses.
Such a mechanism wouldn't even lend itself to protecting the internal consistency of SVs if it's used internally by Perl in addition to user-level locks.
-
It doesn't allow the locking to span more than one SV field access, so operations which both read and write to a field (such as refcount = refcount + 1) still result in a race condition.
\$s
+=====================================================+
| CPU |
+--------------------------+--------------------------+
| thread 1 | thread 2 |
+==========================+==========================+
| - ... | |
| - read $s's refcount | |
| - $s is locked | |
| - refcount is read | |
| - $s is unlocked | |
| - add one | |
+--------------------------+--------------------------+
| | - ... |
| | - read $s's refcount |
| | - $s is locked |
| | - refcount is read |
| | - $s is unlocked |
| | - add one |
| | - write $s's refcount |
| | - $s is locked |
| | - write $s's refcount |
| | - $s is unlocked |
| | - ... |
+--------------------------+--------------------------+
| - write $s's refcount | |
| - $s is locked | |
| - write $s's refcount | |
| - $s is unlocked | |
| - ... | |
+==========================+==========================+
=> refcount is wrong <=
-
The lock is only active for the duration of the write, so it can't be used to control access to multiple fields (such as PV and its length). Most operations (including operations as basic as an assignment) require access to multiple fields atomically.
A few other issues I can't answer come to mind.
-
Would the opcode execute at a higher priviledge level? If so, problems in Perl and in XS become more dangerous and harder to debug.
-
Would this mechanism work on a multi-core or a multi-processor system?
-
When in the protection violation handler, are only perl's threads frozen, or are all threads on the system frozen?
Update: I initially stated "the mechanism would lend itself well to protecting the internal consistency of SVs if it's used internally by Perl in addition to user-level locks", but I reversed my position immediately after posting. I don't know what I was thinking.
Update: Added examples. They are hidden behind spoiler tags to avoid cluttering up the post. The examples assume both read and write accesses (as opposed to just write accesses) lock the SV (which solves problems without adding any).
| [reply] [d/l] [select] |
|
|
Critical sections only affect a single process.
Threads not attempting enter the critical section are not blocked.
Only threads of the same process will ever be blocked, then only if they try to enter the same critical section as has already been acquired by another thread (from the same process), running concurrently on another cpu, as no other thread can run on this cpu until the critical section is exited.
Blocking preemption of a thread sounds like it could have a noticeable effect on other processes within the system, but, at worst--ie. if the thread entered the critical section just before it was about to be preempted--, it would only be extending that threads timeslice for a few 100 or so clock cycles.
On average, the critical section would be entered at the beginning, or in the middle of the timeslice, and possible several times, rather than at the end, and the minuscule extension to the timeslice when it did occur at the end would be negligible.
Swapping to use a PAGE_GUARD/STATUS_GUARD_PAGE exception, rather than PAGE_READONLY/ACCESS_VIOLATION would allow both reads and writes to the shared variables to trigger, and also remove one step from the 8 I listed, as the PAGE_GUARD Attribute is cleared automatically.
The structured exception mechanism is on a per thread basis. Naively, similar to throw/catch.
I see no reason why any privilege changes would occur?
It would work on multi-core/processor systems (if it worked at all!).
Now the $a = $a + 1; scenario (and all the other like it. The opcodes (perl5) representing that operation look like
BINOP (0x191da14) sassign
BINOP (0x191da38) add [3]
UNOP (0x191da7c) null [15]
PADOP (0x191da9c) gvsv GV (0x1824348) *a
SVOP (0x191da5c) const [4] IV (0x22509c) 1
UNOP (0x191dabc) null [15]
PADOP (0x191dadc) gvsv GV (0x1824348) *a
If the critical section (they cost almost nothing (no transfer to kernel mode) to enter if they are unentered by other threads of the same process) are entered at the appropriate point in the opcode tree, in the above case, the sassign, then all the descendant opcodes are also completed within the auspices of the critical section, and so both gvsvs on *a are atomic.
The exception handling mechanism (~throw) would be wrapped around each (mutating) opcode, and the handler (catch) would be specific to that opcode. Normal operations on non-shared variables never throw the exception so cost nothing extra.
The critical section would be entered through the catch code, only when exceptions occurs.
This implies that effectively all accesses to all shared data are serialised across all threads within the process. However, since each critical section will aways be entered and exited within a single (possible marginally extended) timeslice, the longest any thread would have to wait to enter a critical section is 1 timeslice * the number of threads in the process. This longest delay(*) would only occur if all threads, all tried to do a mutating opcode on shared data, on a separate processor, concurrently. The average case would be much, much less.
Does it go anywhere?
(*) The longest wait calculation is not strictly correct. Threads are non-deterministic, therefore there is no guarentee that a thread blocked waiting for entry to a critical section, would receive that entry before another thread that had previously also be waiting, and having already received a timeslice, managed to reeneter the wait state and again be selected.
But this is true for all wait states and in practice does not cause a thread to be starved of cpu over the long term (1 or 2 seconds). Besides which, many if not all OSs have mechanism that temporarially, for one timeslice, boosts the priority of a thread that isn't getting cpu, causing it to get preferential selection on the next round robin.
Also, other processes threads are more likely to be selected than another thread from this process. This is the norm. The time spent in the trheads of other processes (on other cpus as this one is prevented from being preempted if its in a critical section), count to reduce the likelyhood that another thread from this process will be able to run concurrently on that other cpu, and so it serves to reduce the likelyhood that another thread will be blocked from running by this thread holding the critical section.
This stuff is all pretty well tuned, so ignoring the effects of the non-determinism in such calculations is the norm. Over time, the randomness cancels out and the average situation prevails.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
Ah, you mean to hold the lock for all the child opcodes as well. In that case, the Anonymous Monk already answered
-
Sometimes the scope of the CS is too big.
$shared .= <PIPE>;
would need to be written as
my $local = <PIPE>;
$shared .= $local;
-
Sometimes the scope is too small.
# XXX. The key and value are related.
my $string = $shared_key . ' => ' . $shared_value;
won't work without explicit locking.
As you can see, the user still needs to be fully aware of the locking mechanism and must sometimes compensate for it. The only thing accomplished is the removal of the visual cues that something special is happening.
And that's assuming you can figure out where to end the critical section. Consider the following:
- $shared = $shared + 1;
- $local = $shared + 1;
- my $ref = \$shared;
$$ref = $shared + 1;
| [reply] [d/l] [select] |
|
|
Re: A faster?, safer, user transparent, shared variable "locking" mechanism.
by Anonymous Monk on Oct 26, 2006 at 07:31 UTC
|
These will only occur when writing.
The problem is, that many apparently read accesses in Perl, actually do modify the SV. Some innocent looking operations (in all example, ro is apparently read-only):
print $ro + 3; # Might upgrade a PV to a PVIV;
$ref = \$ro; # Increments the refcount on $ro.
$ro =~ /./; # Sets the pos associated with $ro.
$var = $ro{foo}{bar}; # May autovivify $ro{foo}
The fact that there are hidden data modifications is the reason that in Perl variables are by default not shared between threads.
Note also that the SV* just contain some metadata of the values - the real data is some hops away. Even something as simple as a string, with no magic attached will have its data scattered in three places (the SV* itself, containing the refcount, some flags, and a pointer to a structure (svpv) that contains, among other pieces of data, the length of the string, and a pointer to the actual string itself). | [reply] [d/l] |
|
|
The problem is, that many apparently read accesses in Perl, actually do modify the SV
That's not a problem. The proposed locking mechanism doesn't care about appearances. If the operation involves a write, the write will trigger the locking mechanism.
It does fail for other reasons, though.
| [reply] |
|
|
Exactly. So, the system locks the SV without the programmer being aware of it. Not a problem if one thread does so, but then, no lock would have been necessary. But if two threads try to write at once, an exception will be raced. And this all happens behind the programmers back - the programmer thinks she's only doing a read access to a variable (and on a Perl level, she does), but the value is locked anyway.
For a programmer to defend against this, she needs to know all the details of the perl internals. Which, IMO, is not a good thing. (I'm not saying it's a bad thing if more people know the internals of perl - I'm saying it's a bad thing that in order to write a good program, one has to know the internals of perl).
| [reply] |
|
|
|
|
|
|
|
If shared SV*s ( and all things pointed at from them ) are allocated in the protected memory, then any write access, including those done by the system as the result of apparently read-only user code, would also trigger the exception and so be protected.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: A faster?, safer, user transparent, shared variable "locking" mechanism.
by roboticus (Chancellor) on Nov 01, 2006 at 00:03 UTC
|
BrowserUk:
I see quite a few interesting perl-internals and computer science reasons discussed as to why that would be difficult (if not impossible) to do well. However, I hadn't seen any x86 arguments.
I used to be an x86 assembly programmer, so here's my take on it. (Note: my last asm programming was for a Pentium Pro, so the newer chips may have much-improved timings. Though I doubt that it would be enough to help...)
In those days, at any rate, you can execute a simple locking construct in just a few ticks. Just getting to the exception handler took many times more ticks. Then doing the page table manipulations: those were/are expensive operations also. So even if you are able to solve the other issues raised ... it would be slower rather than faster. It could make for shorter, simpler code for the user ... but timing wise, it's a dog.
:roboticus | [reply] |
|
|
roboticus Thanks for your response on the technical limitations of the idea. It's much appreciated.
... you can execute a simple locking construct in just a few ticks.
I'm interested to know what kind of "simple locking mechanism" you are alluding to?
At the assembler level, there are the bittest&reset opcodes (BTR), and other cheap (intra-process-only) mutex mechanisms that can be utilised, but these are not directly available to the C programmer and can often not be used.
There is a rather old, but still interesting article comparing NT threads with Solaris thread at usenix.org.
Here's a quote from that paper:
The poor performance of NT's mutex was directly attributed to its implementation. NT's mutex is a kernel object that has many security attributes that are used to secure its global status. NT's critical section is a simple user object that only calls the kernel when there is contention and a thread must either wait or awaken. Thus, its stronger performance was due to the elimination of the overhead associated with the global mutex.
Of those mechanism available under NT (pre-Vista), I chose the critical section because it is the cheapest for the no-contention case as it avoids a call into the kernel. This means negligible impact on normal, non-shared variables.
I should also point out that critical sections are already used internally by win32 threaded builds of Perl. The problem is that user locks based around the pthreads cond_* apis are implemented using NT Semaphores. And the cause of the lamentable performance of locking on shared data in P5/iThreads is the way (and timing) of how these are used. In many cases, both mechanisms need to be deployed to achieve concurrency safety of shared variables, with the following impact:
cmpthese -1, {
shared => q[
my %h : shared = 1..10; ++$h{ 1 } for 1 ..
+ 1e4
],
shared_locked_hv => q[
my %h:shared = 1 .. 10; do{ lock( %h ); ++$h{ 1 } } for 1 ..
+ 1e4
],
non_shared => q[
my %h = 1 .. 10; ++$h{ 1 } for 1 ..
+ 1e4
],
};;
Rate shared_locked_hv shared non_sha
+red
shared_locked_hv 145/s -- -59% -
+60%
shared 353/s 144% --
+-1%
non_shared 358/s 147% 1%
+ --
As you can see, the impact of the internal (critsec) locking on a shared hash compared to a non-shared hash is minimal at 1%. However, once you apply user locking to the shared hash, the impact goes up markedly with performance being more than halved. And remember, the above is a single-threaded test, so there can be no contention or wait states involved.
If the need for user-level locks, and/or the impact of them in the non-contention case could be alleviated, that would go a long way to removing the impact of threaded builds on non-threaded code.
... Just getting to the exception handler took many times more ticks.
Agreed. But again, you only transit into the exception handler in the event of contention on shared data. For the non-contention, and non-shared cases, the exception handler is not invoke and so there is no impact at all.
Then doing the page table manipulations: those were/are expensive operations also.
Again, page table manipulations are only needed in the shared data case. In this case, at least one call into the kernel is required anyway, to acquire a mutex along with the (potential) wait. The impact of a call to reset the PAGE_GUARD attribute is fairly inconsequential to this path.
Another page attribute that might lend itself to the purpose is MEM_WRITE_WATCH. This is the basis of the NT implementation of COW memory. A similar mechanism is used by *nix OSs to implement the famous COW sharing of data between forks, which appears to operate pretty efficiently.
I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad.
There were/are three motivations for thinking about this:
- The P5/iThreads synchronisation mechanism has a measurable impact upon non-threaded code using non-shared data.
From memory, performance of Perl 5.8 built with threads is around 15% slower (I've found other references that say 30%) than without. This constitutes a very real and concrete hook for the no threads brigade to hang their hats on. I believe a good part of the performance impact on non-threaded, non-shared code is attributable to the need to test whether locking is required.
If the need for locking was detected through memory attributes and an exception handler, then it has no impact if the handler is not invoked, other than the installation of the handler which is minimal.
- The removal of (some of) the need for the Perl programmer to have to handle locking explicitly.
With perl's fat data structures, even a simple scalar is a multi-field struct that can require internal writes for Perl level read-only accesses, it is necessary for perl to protect the internals from concurrent access at all times.
Many Perl level accesses of shared scalars that intuitively do not require synchronisation, actually do require it.
Eg. Pre-incrementing a shared variable, ++$shared; does require user level locking.
As that pre-increment could require the conversion of a string to integer, perls internals already have to synchronise around that access. For these 'simple', short-lived locking requirements, it seems to me that it ought to be possible to extend the internally (required) locks around the user code and remove the need for additional (and duplicated?) locking.
As it is clear internally which opcodes are mutating (in the user sense), applying transparent locking at the opcode (tree) level seems to make sense. As has been shown above, this is not a panacea to the requirement for user locks, but I still believe it could go some way to reducing the user level requirement and also the overhead.
User level locks, cond_signal, cond_wait etc. are implemented using NT semaphores which always require a call into the kernel, whereas the internal synchronisation is based around critical sections. If the latter could be utilised to remove some need for the former, there is a potential performance win as well as a possible simplification benefit.
- If it could be demonstrated that some of this is implementable and has positive benefit in Perl 5, it might encourage the Parrot guys to take it more seriously in the design of threading there.
So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
BrowserUk
... you can execute a simple locking construct in just a few ticks.
I'm interested to know what kind of "simple locking mechanism" you are alluding to?
At the assembler level, there are the bittest&reset opcodes (BTR), and other cheap (intra-process-only) mutex mechanisms that can be utilised, but these are not directly available to the C programmer and can often not be used.
Many C/C++ compilers provide an _asm keyword (or equivalent) to give access to the various atomic instructions you can use for mutexes and such. However, I'll grant you that one and simply postulate that some kind author provided a library routine to do the job (and give a way a few ticks for stack frame creation/teardown). Then I'll generously grant you the same facility to provide your C/C++ program the ability to run the instructions to alter the page tables and such.
And while you correctly point out that my initial response did concentrate too much on the timing consequences of where the exception was taken, there are two other aspects that I should've mentioned. First: The page table instructions are slow, too. IIRC, they're slower than the simple locks. Secondly: Even if they've gotten faster, they're priviledged instructions. So in order for Perl to use them, it either has to switch to kernel mode to execute the instructions, then switch back to user mode to run. Two context switches are expensive to pay for every variable modification, even if the exception never triggers.
Adding in the fact that you'd have to find a way for Perl to get the permissions to do so for all platforms concerned is just another problem to deal with...
... Just getting to the exception handler took many times more ticks.
Agreed. But again, you only transit into the exception handler in the event of contention on shared data. For the non-contention, and non-shared cases, the exception handler is not invoke and so there is no impact at all.
As I read your original post, I thought you were suggesting that you would get the exception each time you write to the variable, rather than each time you get contention for it. For a read-only (or read-nearly-always) data structure, I can see where this could be a win. But for a frequently-written value, I don't see how you can get the speed you need with the requirement to have two context switches for each write.
I hear what you are saying with regard to performance when the exception path is taken and page table manipulations are required, but relative to the performance impact of the existing locking mechanisms used by iThreads (which is huge), I think that it might not be so bad.
You may well be correct there ... I don't know what iThreads uses for locking. Does Perl use iThreads?
So, whilst I am sure that you are correct in saying that transfers into exception handlers and page table manipulations are relatively expensive, when you look at the costs involved in the current situation they may not be as expensive relative speaking as first appears; ie. when you are no longer comparing them with assembler level, user-space only bittest&reset opcodes. Assuming that is what you were comparing them against.
Well, it's clear that you've put far more thought into it than I have, so I'm willing to be proven wrong. (Actually, since I'm a speed junkie, I'd *love* it! A faster Perl would be a boon for all.)
And, yes, I was relying on the simple user-space locks, assuming that you're sharing the data with other threads in the same program...
Interesting ... I'll have to chew on it for a while and see what I come up with.....
roboticus
| [reply] |
|
|
|
|