The major benefit to the approach described in the video is that a huge hash can be resized while other updates continue to happen. But both approaches involve full copies of the hash existing at the same time.
I get that; and hope I'll be able to utilise the technique for resizing.
(I'm not sure yet because my hash uses nested hashes for the (collision) buckets; which in turn can contain nested hashes... So, I can already be resizing at multiple levels; and I haven't wrapped my brain around, doing that lock-free and on-the-fly!)
The problem I have with using this in conjunction with iterators; is that the user might create an iterator, iterate the first few keys and then simply abandon it leaving it pointing into an old copy of the hash. Either I have to keep the old hash around until the iterator completes or is explicitly freed up; or I risk leaving dangling iterators when a resized hash finally frees up the previous generation.
I have real problems seeing a clean way (or actually any way) of continuing an iterator, taken on the previous generation, once a resize has occurred. At some point the old generation will be done with -- by the on-the-fly resize operation -- and be eligible for freeing; but if there are active-but-abandoned (or just slow) iterators still traversing that old generation; freeing them will leave dangling pointers.
Deferring the free until all existing iterators complete or are explicitly closed could mean they'll never go away; and I see no way to logically continue an iterator from a previous generation over to the later generation. Hence my decision to abort an iterator if a resize occurs; unless the user explicitly requests otherwise. And be it on his head if he does so.
Not that I'm trying to argue that you should use RCU, just pushing back lightly on one reason you stated for avoiding it.
Thank you. It is for exactly this I posted the OP. It is all too easy to arrive at a conclusion without really having explored the possibilities; and without challenge, go on to ignore them. For me, the very act of trying to write down my justifications/defense of my decisions, is the single best guarantee that I've really thought things through.
the risk with this is that updates happen so frequently that you have to retry your CAS over and over, even an unbounded number of times
Understood. I've tried very hard to generate a test case where this happens and found it impossible on modern hardware. See below.
With regard to people basing their fears of interlocking on benchmarks. The problem is that they all too often compare apples with oranges by running interlocked and non-interlocked tests in single threaded testcases; and on unshared, uncontended data. I hope my benchmark below avoids those failings.
This risk I've heard of with this is that, at least on some architectures, an interlocked increment is accomplished via a bus lock and the bus lock can have (I've heard it claimed) a bigger negative impact on concurrency than a write lock would
Yes. This was definitely a problem on early multi-socket NUMA setups and even earlier generations of multi-core cpus.
However, on more modern chips -- including my now dated (2007) Intel Q6600 -- an exclusive lock is applied to the cache-line containing the target of the interlocked operation, thus ensuring residency (and fast access) for the duration of the read-modify-write operations; whilst leaving access to the rest of memory unhindered by that lock.
(This also deals with your cache coherency problem above. Once the cache line has been populated the first time, it becomes locked in core, and the coherency is guaranteed until it is unlocked; and for highly active lines, the hardware's LRU caching algorithms ensure that it stays around even when it is unlocked.)
By way of getting some feel for the cost of interlocked increment .v. unlocked increment to shared state .v. increments to thread-local state, I wrote this test app: #pragma warning(disable:4996)
#include <windows.h>
#include <winhttp.h>
#include <stdio.h>
#include <stdlib.h>
#include <process.h>
#include "mytypes.h"
#include "tpool.h"
#include "bitlock.h"
typedef struct {
U64 sem;
U32 n;
union {
U64 version;
struct { U16 counts[4]; };
};
} ARGS;
void worker0( void *arg ) {
ARGS *v = (ARGS*)arg;
volatile U16 dummy[ 4 ];
U32 i;
wait( &v->sem, 0, 1 );
for( i = 0; i < v->n; ++i ) ++dummy[ rand() % 4 ];
return;
}
void worker1( void *arg ) {
ARGS *v = (ARGS*)arg;
U32 i;
wait( &v->sem, 0, 1 );
for( i = 0; i < v->n; ++i ) _InterlockedIncrement16( &v->counts[ r
+and() % 4 ] );
return;
}
void worker2( void *arg ) {
ARGS *v = (ARGS*)arg;
U32 i;
wait( &v->sem, 0, 1 );
for( i = 0; i < v->n; ++i ) ++v->counts[ rand() % 4 ];
return;
}
int main( int argc, char **argv ) {
U32 workers = argc > 1 ? atoi( argv[ 1 ] ) : 4;
U32 n = argc > 2 ? atoi( argv[ 2 ] ) : 100000000;
ARGS a;
HANDLE pool;
LARGE_INTEGER freq, start, stop;
QueryPerformanceFrequency( &freq );
a.sem = 0ull; a.n = n; a.version = 0ull;
lock( &a.sem, 0, 0 );
pool = tpool( &worker0, workers, &a );
QueryPerformanceCounter( &start );
unlock( &a.sem, 0 );
WaitForSingleObject( pool, INFINITE );
QueryPerformanceCounter( &stop );
printf( "%4u threads x %10u iterations of normal increment non-sha
+red took %14.9f seconds\n",
workers, n, (double)( stop.QuadPart - start.QuadPart ) / freq.
+QuadPart
);
a.sem = 0ull; a.n = n; a.version = 0ull;
lock( &a.sem, 0, 0 );
pool = tpool( &worker1, workers, &a );
QueryPerformanceCounter( &start );
unlock( &a.sem, 0 );
WaitForSingleObject( pool, INFINITE );
QueryPerformanceCounter( &stop );
printf( "%4u threads x %10u iterations of interlockedIncrement1
+6() took %14.9f seconds\n",
workers, n, (double)( stop.QuadPart - start.QuadPart ) / freq.
+QuadPart
);
a.sem = 0ull; a.n = n; a.version = 0ull;
lock( &a.sem, 0, 0 );
pool = tpool( &worker2, workers, &a );
QueryPerformanceCounter( &start );
unlock( &a.sem, 0 );
WaitForSingleObject( pool, INFINITE );
QueryPerformanceCounter( &stop );
printf( "%4u threads x %10u iterations of normal increment shared
+ took %14.9f seconds\n",
workers, n, (double)( stop.QuadPart - start.QuadPart ) / freq.
+QuadPart
);
return 0;
}
That spawns a number of threads and has them all block on a semaphore until they are all up and running; then the main thread clears the semaphore and they all iterate a large number of times incrementing their target (one of 4 U16s chosen at random) and then end. The main thread times from the clearing of the semaphore to the point all the threads have ended.
The three different cases are:
- An 'idle' loop;
Each thread simply increments 1 of 4 at random thread-local (volatile to prevent it being optimised away) U16s.
To give an uncontended baseline.
- An interlocked, shared-state loop;
Each thread interlockIncrements one 1 of 4 at random shared, U16s (all in the same cache line).
- A free running (unsafe) shared state loop;
Each thread performs normal C-style increments one 1 of 4 at random shared, U16s (all in the same cache line).
The results show that whilst unlocked increments (of shared state) are slight faster, the difference is very marginal. And even with large numbers of threads contending, the timings are very linear in the number of threads.:
C:\test>for %t in (1,2,4,16,63,255) do @interlock %t 10000000
1 threads x 10000000 iterations of normal increment non-shared to
+ok 0.269178904 seconds
1 threads x 10000000 iterations of interlockedIncrement16() to
+ok 0.501895073 seconds
1 threads x 10000000 iterations of normal increment shared to
+ok 0.327297743 seconds
2 threads x 10000000 iterations of normal increment non-shared to
+ok 0.268091336 seconds
2 threads x 10000000 iterations of interlockedIncrement16() to
+ok 1.407993195 seconds
2 threads x 10000000 iterations of normal increment shared to
+ok 1.072541343 seconds
4 threads x 10000000 iterations of normal increment non-shared to
+ok 0.331791610 seconds
4 threads x 10000000 iterations of interlockedIncrement16() to
+ok 4.877205343 seconds
4 threads x 10000000 iterations of normal increment shared to
+ok 4.260821138 seconds
16 threads x 10000000 iterations of normal increment non-shared to
+ok 1.172022981 seconds
16 threads x 10000000 iterations of interlockedIncrement16() to
+ok 19.987548697 seconds
16 threads x 10000000 iterations of normal increment shared to
+ok 17.639230964 seconds
63 threads x 10000000 iterations of normal increment non-shared to
+ok 4.423643228 seconds
63 threads x 10000000 iterations of interlockedIncrement16() to
+ok 78.821976536 seconds
63 threads x 10000000 iterations of normal increment shared to
+ok 69.604602540 seconds
255 threads x 10000000 iterations of normal increment non-shared to
+ok 17.773377901 seconds
255 threads x 10000000 iterations of interlockedIncrement16() to
+ok 320.197832965 seconds
255 threads x 10000000 iterations of normal increment shared to
+ok 282.370355450 seconds
C:\test>interlock 1023 1000000
1023 threads x 1000000 iterations of normal increment non-shared to
+ok 7.115383938 seconds
1023 threads x 1000000 iterations of interlockedIncrement16() to
+ok 127.386793294 seconds
1023 threads x 1000000 iterations of normal increment shared to
+ok 112.850957873 seconds
Note: That the last set of numbers is for 1/10th the number of iterations (I didn't want to wait the 40 odd minutes cos my meals almost ready); but still the (slightly sub)linearity holds. Of course, that is in part because I only have 4 cores; the additional contention from a greater number of concurrent cores might change things, but I don't have free access to bigger hardware that I could use for personal stuff at the moment. Maybe later in the week.
Thanks for sharing your question, references, and plans. It has been quite interesting.
Thanks for the feedback and pushback.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |