The reason I've rejected [RCU] for the iterator problem is that it isn't individual values changing that is the big danger; it is the state of the entire structure changing -- ie. resizes. In order to use RCU, the iterator would have to take a complete copy of the hash
Well, I'd see the RCU route as you continue to iterate over the old copy of the hash, ignoring updates that are being done in the new, resized copy of the hash. So the RCU iterator doesn't take a copy of the hash, it just references a copy of the hash. That is, iterators started before the resize are all iterating over the same (old) copy of the hash while iterators started after the resize are all iterating over the same (new) copy of the hash. The iterator could even recheck against the current copy of the hash even though it is still iterating through an old copy (so it wouldn't use stale values nor deleted keys but would not see new keys).
The major benefit to the approach described in the video is that a huge hash can be resized while other updates continue to happen. But both approaches involve full copies of the hash existing at the same time (but not a full copy per iterator; I'd not do that with either approach).
Not that I'm trying to argue that you should use RCU, just pushing back lightly on one reason you stated for avoiding it. There are plenty of advantages.
__int64 _InterlockedCompareExchange64(__int64 volatile *,__int64,__int
+64)
Although I think that you are probably aware of this, the risk with this is that updates happen so frequently that you have to retry your CAS over and over, even an unbounded number of times (at least in theory). I've run into a lot examples of people jumping through hoops to avoid such a problem. I have yet to stumble upon somebody presenting good evidence that this actually became a practical problem when doing something on the order of a simple increment.
So I still consider several possible explanations for such hoop jumping as plausible:
- I just haven't seen the evidence and CAS for incrementing a shared count can fairly easily become overly expensive.
- People are jumping through these hoops to avoid scaling problems with "obtain lock then increment" and they didn't try CAS on a single counter.
- People see real problems with too many retries via CAS for operations much more expensive than a simple increment and so avoid CAS on a single shared count due to those other scenarios, even though it hasn't been proved to lead to practical problems for a simple increment.
- People are jumping through these hoops as premature optimization simply out of fear that the number of retries could become problematically large.
- (Added soon after posting) Details of cache coherency between cores (or similar) is the real problem with frequent updates to a single shared count.
So I call out the risk because I see so many examples of people trying to avoid that risk, not because I'm convinced that it can be a real problem or that it is likely to be a real problem. If I had to predict whether it would be a practical problem, I'd predict that it would be very unlikely to be very significant. But I've managed to avoid needing such an approach so I wouldn't bet much money.
short _InterlockedIncrement16(short volatile *Addend)
This risk I've heard of with this is that, at least on some architectures, an interlocked increment is accomplished via a bus lock and the bus lock can have (I've heard it claimed) a bigger negative impact on concurrency than a write lock would (at least if you are using this on a lot of shared counts where contention for any single lock would be low while the frequency of bus locks would be high). The times I've used such, I haven't observed a significant drop in concurrency, but that doesn't mean such can't happen, of course.
Thanks for sharing your question, references, and plans. It has been quite interesting.
| [reply] [d/l] [select] |
The major benefit to the approach described in the video is that a huge hash can be resized while other updates continue to happen. But both approaches involve full copies of the hash existing at the same time.
I get that; and hope I'll be able to utilise the technique for resizing.
(I'm not sure yet because my hash uses nested hashes for the (collision) buckets; which in turn can contain nested hashes... So, I can already be resizing at multiple levels; and I haven't wrapped my brain around, doing that lock-free and on-the-fly!)
The problem I have with using this in conjunction with iterators; is that the user might create an iterator, iterate the first few keys and then simply abandon it leaving it pointing into an old copy of the hash. Either I have to keep the old hash around until the iterator completes or is explicitly freed up; or I risk leaving dangling iterators when a resized hash finally frees up the previous generation.
I have real problems seeing a clean way (or actually any way) of continuing an iterator, taken on the previous generation, once a resize has occurred. At some point the old generation will be done with -- by the on-the-fly resize operation -- and be eligible for freeing; but if there are active-but-abandoned (or just slow) iterators still traversing that old generation; freeing them will leave dangling pointers.
Deferring the free until all existing iterators complete or are explicitly closed could mean they'll never go away; and I see no way to logically continue an iterator from a previous generation over to the later generation. Hence my decision to abort an iterator if a resize occurs; unless the user explicitly requests otherwise. And be it on his head if he does so.
Not that I'm trying to argue that you should use RCU, just pushing back lightly on one reason you stated for avoiding it.
Thank you. It is for exactly this I posted the OP. It is all too easy to arrive at a conclusion without really having explored the possibilities; and without challenge, go on to ignore them. For me, the very act of trying to write down my justifications/defense of my decisions, is the single best guarantee that I've really thought things through.
the risk with this is that updates happen so frequently that you have to retry your CAS over and over, even an unbounded number of times
Understood. I've tried very hard to generate a test case where this happens and found it impossible on modern hardware. See below.
With regard to people basing their fears of interlocking on benchmarks. The problem is that they all too often compare apples with oranges by running interlocked and non-interlocked tests in single threaded testcases; and on unshared, uncontended data. I hope my benchmark below avoids those failings.
This risk I've heard of with this is that, at least on some architectures, an interlocked increment is accomplished via a bus lock and the bus lock can have (I've heard it claimed) a bigger negative impact on concurrency than a write lock would
Yes. This was definitely a problem on early multi-socket NUMA setups and even earlier generations of multi-core cpus.
However, on more modern chips -- including my now dated (2007) Intel Q6600 -- an exclusive lock is applied to the cache-line containing the target of the interlocked operation, thus ensuring residency (and fast access) for the duration of the read-modify-write operations; whilst leaving access to the rest of memory unhindered by that lock.
(This also deals with your cache coherency problem above. Once the cache line has been populated the first time, it becomes locked in core, and the coherency is guaranteed until it is unlocked; and for highly active lines, the hardware's LRU caching algorithms ensure that it stays around even when it is unlocked.)
By way of getting some feel for the cost of interlocked increment .v. unlocked increment to shared state .v. increments to thread-local state, I wrote this test app: #pragma warning(disable:4996)
#include <windows.h>
#include <winhttp.h>
#include <stdio.h>
#include <stdlib.h>
#include <process.h>
#include "mytypes.h"
#include "tpool.h"
#include "bitlock.h"
typedef struct {
U64 sem;
U32 n;
union {
U64 version;
struct { U16 counts[4]; };
};
} ARGS;
void worker0( void *arg ) {
ARGS *v = (ARGS*)arg;
volatile U16 dummy[ 4 ];
U32 i;
wait( &v->sem, 0, 1 );
for( i = 0; i < v->n; ++i ) ++dummy[ rand() % 4 ];
return;
}
void worker1( void *arg ) {
ARGS *v = (ARGS*)arg;
U32 i;
wait( &v->sem, 0, 1 );
for( i = 0; i < v->n; ++i ) _InterlockedIncrement16( &v->counts[ r
+and() % 4 ] );
return;
}
void worker2( void *arg ) {
ARGS *v = (ARGS*)arg;
U32 i;
wait( &v->sem, 0, 1 );
for( i = 0; i < v->n; ++i ) ++v->counts[ rand() % 4 ];
return;
}
int main( int argc, char **argv ) {
U32 workers = argc > 1 ? atoi( argv[ 1 ] ) : 4;
U32 n = argc > 2 ? atoi( argv[ 2 ] ) : 100000000;
ARGS a;
HANDLE pool;
LARGE_INTEGER freq, start, stop;
QueryPerformanceFrequency( &freq );
a.sem = 0ull; a.n = n; a.version = 0ull;
lock( &a.sem, 0, 0 );
pool = tpool( &worker0, workers, &a );
QueryPerformanceCounter( &start );
unlock( &a.sem, 0 );
WaitForSingleObject( pool, INFINITE );
QueryPerformanceCounter( &stop );
printf( "%4u threads x %10u iterations of normal increment non-sha
+red took %14.9f seconds\n",
workers, n, (double)( stop.QuadPart - start.QuadPart ) / freq.
+QuadPart
);
a.sem = 0ull; a.n = n; a.version = 0ull;
lock( &a.sem, 0, 0 );
pool = tpool( &worker1, workers, &a );
QueryPerformanceCounter( &start );
unlock( &a.sem, 0 );
WaitForSingleObject( pool, INFINITE );
QueryPerformanceCounter( &stop );
printf( "%4u threads x %10u iterations of interlockedIncrement1
+6() took %14.9f seconds\n",
workers, n, (double)( stop.QuadPart - start.QuadPart ) / freq.
+QuadPart
);
a.sem = 0ull; a.n = n; a.version = 0ull;
lock( &a.sem, 0, 0 );
pool = tpool( &worker2, workers, &a );
QueryPerformanceCounter( &start );
unlock( &a.sem, 0 );
WaitForSingleObject( pool, INFINITE );
QueryPerformanceCounter( &stop );
printf( "%4u threads x %10u iterations of normal increment shared
+ took %14.9f seconds\n",
workers, n, (double)( stop.QuadPart - start.QuadPart ) / freq.
+QuadPart
);
return 0;
}
That spawns a number of threads and has them all block on a semaphore until they are all up and running; then the main thread clears the semaphore and they all iterate a large number of times incrementing their target (one of 4 U16s chosen at random) and then end. The main thread times from the clearing of the semaphore to the point all the threads have ended.
The three different cases are:
- An 'idle' loop;
Each thread simply increments 1 of 4 at random thread-local (volatile to prevent it being optimised away) U16s.
To give an uncontended baseline.
- An interlocked, shared-state loop;
Each thread interlockIncrements one 1 of 4 at random shared, U16s (all in the same cache line).
- A free running (unsafe) shared state loop;
Each thread performs normal C-style increments one 1 of 4 at random shared, U16s (all in the same cache line).
The results show that whilst unlocked increments (of shared state) are slight faster, the difference is very marginal. And even with large numbers of threads contending, the timings are very linear in the number of threads.:
C:\test>for %t in (1,2,4,16,63,255) do @interlock %t 10000000
1 threads x 10000000 iterations of normal increment non-shared to
+ok 0.269178904 seconds
1 threads x 10000000 iterations of interlockedIncrement16() to
+ok 0.501895073 seconds
1 threads x 10000000 iterations of normal increment shared to
+ok 0.327297743 seconds
2 threads x 10000000 iterations of normal increment non-shared to
+ok 0.268091336 seconds
2 threads x 10000000 iterations of interlockedIncrement16() to
+ok 1.407993195 seconds
2 threads x 10000000 iterations of normal increment shared to
+ok 1.072541343 seconds
4 threads x 10000000 iterations of normal increment non-shared to
+ok 0.331791610 seconds
4 threads x 10000000 iterations of interlockedIncrement16() to
+ok 4.877205343 seconds
4 threads x 10000000 iterations of normal increment shared to
+ok 4.260821138 seconds
16 threads x 10000000 iterations of normal increment non-shared to
+ok 1.172022981 seconds
16 threads x 10000000 iterations of interlockedIncrement16() to
+ok 19.987548697 seconds
16 threads x 10000000 iterations of normal increment shared to
+ok 17.639230964 seconds
63 threads x 10000000 iterations of normal increment non-shared to
+ok 4.423643228 seconds
63 threads x 10000000 iterations of interlockedIncrement16() to
+ok 78.821976536 seconds
63 threads x 10000000 iterations of normal increment shared to
+ok 69.604602540 seconds
255 threads x 10000000 iterations of normal increment non-shared to
+ok 17.773377901 seconds
255 threads x 10000000 iterations of interlockedIncrement16() to
+ok 320.197832965 seconds
255 threads x 10000000 iterations of normal increment shared to
+ok 282.370355450 seconds
C:\test>interlock 1023 1000000
1023 threads x 1000000 iterations of normal increment non-shared to
+ok 7.115383938 seconds
1023 threads x 1000000 iterations of interlockedIncrement16() to
+ok 127.386793294 seconds
1023 threads x 1000000 iterations of normal increment shared to
+ok 112.850957873 seconds
Note: That the last set of numbers is for 1/10th the number of iterations (I didn't want to wait the 40 odd minutes cos my meals almost ready); but still the (slightly sub)linearity holds. Of course, that is in part because I only have 4 cores; the additional contention from a greater number of concurrent cores might change things, but I don't have free access to bigger hardware that I could use for personal stuff at the moment. Maybe later in the week.
Thanks for sharing your question, references, and plans. It has been quite interesting.
Thanks for the feedback and pushback.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |