Re: Simple arithmetic? (And the winner is ... )

Replies are listed 'Best First'.
Re^2: Simple arithmetic? (And the winner is ... ) by oiskuu (Hermit) on Mar 09, 2015 at 11:13 UTC
"I wuz in ur um-puter, eating ur loopz." I don't know how MSVC handles things, but generally there are other ways to prevent unwanted benchmark optimizations. Using volatile is almost never necessary, and very often has different implications to what was intended. Just one example, quoting gcc.info: Accesses to non-volatile objects are not ordered with respect to volatile accesses. You cannot use a volatile object as a memory barrier to order a sequence of writes to non-volatile memory. Usually, a barrier of some sort is indicated. With gcc/icc/clang, the following optimization barrier should work: `#define barrier() asm volatile("": : :"memory")` [download] (Here the `volatile` may be due to some compiler bugs.) The ways to prevent a compiler of being too smart: Put the test routine(s) in separate files ("compilation units"), link together. But nowadays we can have link time optimization (`-flto`), so care must be taken (incremental linking for those parts that want lto). Mark your test routines with `__attribute__((__noinline__))`, so they don't get inlined. Result should be similar to having them in separate units. However, this does NOT prevent e.g. clang determining that the function is a nop, and optimizing it out! Use optimization barrier(s) in the loop and/or test routine. This will make sure the loop stays, even if the body is a nop. Barrier in the test routine might be necessary, lest the compiler discover and hoist a "const" function. Have the main loop calculate some sort of result. In other words, the program as a whole needs to have a side effect. E.g. if the test routine returns int, one might compute a checksum and print that in the end; will also serve as verification. So, probably it's best to use both noinline attribute and an optimization barrier. Now, getting back to the original topic. Lcm with 4096 means simply to have twelve zeroes at the end. I'd code it like this: `$n <<= 1 while $n & 0xfff;` [download] (Counting trailing zeroes can also be optimized. Some x86 (Haswell) have an instruction for that. Wikipedia links to Bit Twiddling Hacks; there are other sites and books on the topic.) Update: added the 4th clause to above list.	[reply] [d/l] [select]
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 12:06 UTC
So, probably it's for the best to use both noinline attribute and an optimization barrier. The problem is that nothing I do inside the function, whether inlined or not, will prevent the compiler optimising the loop away. The decision is, (appears to be), that because the variable to which the result of the function is assigned is never used outside the loop, and the function has no side effects, the loop is redundant. (Though I haven't worked out why the second loop isn't optimised away for similar reasons. The use of volatile on that variable works because the compiler cannot decide that the value is never used. Whilst volatile can have other effects upon the code generated, these mostly relate to multi-threaded code which is not in play here. Also, the MS compiler has extended guarantees with regard to volatile (on non-ARM systems): /volatile:ms Selects Microsoft extended volatile semantics, which add memory ordering guarantees beyond the ISO-standard C++ language. Acquire/release semantics are guaranteed on volatile accesses. However, this option also forces the compiler to generate hardware memory barriers, which might add significant overhead on ARM and other weak memory-ordering architectures. If the compiler targets any platform except ARM, this is default interpretation of volatile. Conversely, the read/write/ReadWrite/Barrier intrinsics are now deprecated: The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics and the MemoryBarrier macro are all deprecated and should not be used. For inter-thread communication, use mechanisms such as atomic_thread_fence and std::atomic<T>, which are defined in the C++ Standard Library Reference. For hardware access, use the /volatile:iso compiler option together with the volatile (C++) keyword. The volatile keyword also seems to impose lesser, but enough, constraints. (That's an interpretation, rather than an MS stated fact.) Now, getting back to the original topic. Lcm with 4096 means simply to have twelve zeroes at the end. I'd code it like this: `$n <<= 1 while $n & 0xfff;` [download] Update: ~~Um. That code doubles $n whilst it's lower than 4096; the original code calls for halving it; at most 12 times?~~ ~~And if I just switched the shift direction, an input $n of 0xffffffffff; would be reduced to 0 before the loop terminated.~~ Ignore the above, I left the `lcm = n * 40096;` in, where (I assume) you meant to replace: `for( c = 1; ( ~n & 1 ) && ( c < 4096 ); c <<= 1 ) n >>= 1; lcm = n * 4096;` [download] With: `while( n & 0xfff ) n <<= 1; lcm = n;` [download] Which works, but takes twice as long as the original the original version: `C:\test\C>gcm gcm : 2132901888 gcm2: 2132901888 gcm3: 2132901888 anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:33.850023994460 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took:46.293298113614 anonyM: gcm3 for s=2147483648 & r=1 to 1073741824 took:64.208097030422` [download] /Update: (Counting trailing zeroes can also be optimized.) I thought about that last night and tried using the __BitScanForward64() intrinsic: `U64 gcm2( U64 max, U64 n ) { U32 b; U64 c, lcm; _BitScanForward64( &b, n ); n >>= min( b, 12 ); lcm = n * 4096; return ( max / lcm ) * lcm; }` [download] Which looked like it should be more efficient, compiling to this: `PUBLIC gcm2 ; Function compile flags: /Ogtpy _TEXT SEGMENT max$ = 8 n$ = 16 gcm2 PROC ; 24 : U32 b; ; 25 : U64 c, lcm; ; 26 : ; 27 : _BitScanForward64( &b, n ); bsf rax, rdx mov r8, rcx mov r9, rdx ; 28 : n >>= min( b, 12 ); mov ecx, 12 cmp eax, ecx cmovb ecx, eax ; 29 : lcm = n * 4096; ; 30 : return ( max / lcm ) * lcm; xor edx, edx mov rax, r8 shr r9, cl shl r9, 12 div r9 imul rax, r9 ; 31 : } ret 0 gcm2 ENDP` [download] Rather than this: PUBLIC gcm ; Function compile flags: /Ogtpy _TEXT SEGMENT max$ = 8 n$ = 16 gcm PROC ; 16 : U64 c, lcm; ; 17 : ; 18 : for( c = 1; ( ~n & 1 ) && ( c < 4096 ); c <<= 1 ) n >>= 1 +; movzx eax, dl mov r8d, 1 mov r9, rdx not al test al, r8b je SHORT $LN1@gcm $LL3@gcm: cmp r8, 4096 ; 00001000H jae SHORT $LN1@gcm shr r9, 1 add r8, r8 movzx eax, r9b not al test al, 1 jne SHORT $LL3@gcm $LN1@gcm: ; 19 : lcm = n * 4096; shl r9, 12 ; 20 : return ( max / lcm ) * lcm; xor edx, edx mov rax, rcx div r9 imul rax, r9 ; 21 : } ret 0 gcm ENDP [download] But the reality turns out to be disappointingly about 50% slower: `C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took: 33.92063749171 +5 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took: 46.15194765908 +9 oiskuu: gcm for s=2147483648 & r=1 to 1073741824 took:330.49201177311 +0` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 14:36 UTC
And no sooner have I said that, and I think of a way to use _BitScanForward64() that improves upon the original by 70+%: `C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:8.535316994670 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took:2.271266720696` [download] Pre-masking the scanned var, avoids the later subtraction being conditional: `U64 gcm2( U64 max, U64 lcm ) { I32 b; _BitScanForward64( &b, lcm & 0xfff ); lcm <<= ( 12 - b ); return ( max / lcm ) * lcm; }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^4: Simple arithmetic? (And the winner is ... ) by oiskuu (Hermit) on Mar 09, 2015 at 21:52 UTC
Fixed version with a quick demonstration of the problem: #include <stdio.h> #include <stdint.h> #include <stdlib.h> uint64_t gcm2( uint64_t max, uint64_t lcm ) { int b = __builtin_ctzl( lcm & 0xfff ); lcm <<= ( 12 - b ); return ( max / lcm ) * lcm; } uint64_t gcm(uint64_t max, uint64_t lcm) { lcm <<= 12 - __builtin_ctzl(lcm \| 0x1000); return max - (max % lcm); } int main(int argc, char *argv[]) { unsigned long max = 12345 << 10; while (argc-- > 1) { unsigned long v = strtoul(argv[argc], NULL, 10); printf("gcm2(%lu,%lu) = %lu\n", max, v, gcm2(max, v)); printf("gcm(%lu,%lu) = %lu\n", max, v, gcm(max, v)); } return 0; } [download] $ ./a.out 12 77 4096 gcm2(12641280,4096) = 0 gcm(12641280,4096) = 12640256 gcm2(12641280,77) = 12615680 gcm(12641280,77) = 12615680 gcm2(12641280,12) = 12632064 gcm(12641280,12) = 12632064	[reply] [d/l]
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 13:51 UTC
FWIW: Of many things I've tried to improve on Anonymous Monk's gcm(), the only thing that has worked is this: `U64 gcm4( U64 max, U64 n ) { U64 c, lcm; for( c = 1; ( ~n & 1 ) && ( c < 12 ); ++c ) n >>= 1; lcm = n * 4096; return ( max / lcm ) * lcm; }` [download] Incrementing seems to be fractionally faster than shifting: `anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:33.850345514550 + ### original anonyM: gcm4 for s=2147483648 & r=1 to 1073741824 took:33.590549831120 + ### my tweak` [download] But the difference, 1/4 second on > 1 billions calls is so minimal (0.2 nanoseconds), and the shifting better captures the semantics, I've stuck with the original. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^2: Simple arithmetic? (And the winner is ... ) by Anonymous Monk on Mar 08, 2015 at 20:46 UTC
Well that's a weird result BUK :) So I actually compiled your code (with some changes to make it suitable for gcc on linux). I was getting strange results too. 1500 nanoseconds for anonM algorithm, no matter what `s` was (at -O2). I'm curious why that happened. With some checksums to verify results the execution time became more realistic (with 1:10 bias towards bitshift stuff, but perhaps your compiler recognizes Euclid's algorithm? gcc 4.7 doesn't, it seems :) I don't know C very well, though (so here it is) Read more... (2 kB)	[reply] [d/l] [select]
Re^3: Simple arithmetic? (And the (real) winner is ... ) by BrowserUk (Patriarch) on Mar 08, 2015 at 23:00 UTC
Anonymonk's implementation by an order of magnitude over oiskuu's: `C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took: 33.607916866993 + oiskuu: gcm for s=2147483648 & r=1 to 1073741824 took:329.551997991197` [download] Thanks for the sanity check AnonyMonk! And the difference between this benchmark and the previous (unbelievable) one? One keyword: void main( int argc, char *argv ) { U64 s = argc > 1 ? _atoi64( argv[ 1 ] ) : 2GB; U64 r, start, end; U64 volatile gsm; // ^^^^^^^^ * Prevent the co +mpiler from optimising the loop away in its entirety * D'oh! start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { gsm = gcm( s, r ); } end = __rdtsc(); printf( "anonyM: gcm for s=%I64u & r=1 to %I64u took:%.12f [%I64u] +\n", s, s>>1, (double)( end - start ) / 2394046359.0 ); start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { gsm = s - ( s % lcm( r, 4096 ) ); } end = __rdtsc(); printf( "oiskuu: gcm for s=%I64u & r=1 to %I64u took:%.12f [%I64u] +\n", s, s>>1, (double)( end - start ) / 2394046359.0 ); return; } [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^4: Simple arithmetic? (And the (real) winner is ... ) by Anonymous Monk on Mar 08, 2015 at 23:49 UTC
Wow. At one point I almost suspected something like that, and even tried to change the order of algorithms, but... that didn't change anything (for some reason), and why the hell the compiler would optimize away just one loop? That's really obscure (**** compilers, how do they work? :)	[reply]
Re^5: Simple arithmetic? (And the (real) winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 01:44 UTC
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 08, 2015 at 22:47 UTC
Well that's a weird result BUK :) So I actually compiled your code (with some changes to make it suitable for gcc on linux). I was getting strange results too. 1500 nanoseconds for anonM algorithm, no matter what s was (at -O2). I'm curious why that happened. With some checksums to verify results the execution time became more realistic (with 1:10 bias towards bitshift stuff, I understand your reaction. I had the same. I spent 4 hours trying to find something amiss. I failed. If you look at the commented out verification code in what I posted above (with extra comments): void main( int argc, char *argv ) { U64 s = argc > 1 ? _atoi64( argv[ 1 ] ) : 2GB; U64 r, start, end, gsm; // U64 res = malloc( sizeof(U64) (s>>1) ); + //// An array to store every result from loop 1 (anonyM) start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { // res[ r ] = + //// Assign the result to the array gsm = gcm( s, r ); } end = __rdtsc(); printf( "anonyM: gcm for s=%I64u & r=1 to %I64u took:%.9f\n", s, s +>>1, (double)( end - start ) / 2394046359.0 ); start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { // if( res[ r ] != + //// If the result from the first loop, doesn't match the result fro +m this loop ( gsm = s - ( s % lcm( r, 4096 ) ) ) // ) printf( "%I64u != %I64u\n", gsm, res[ r ] ) + ///// Yell about it. ; } end = __rdtsc(); printf( "oiskuu: gcm for s=%I64u & r=1 to %I64u took:%.9f\n", s, s +>>1, (double)( end - start ) / 2394046359.0 ); return; } [download] Of course, the additional code slows things down and changes the timing completely, but not a single mismatch is found. I can't think of a more comprehensive way to verify the algorithms than to directly compare their results for a full range of inputs? but perhaps your compiler recognizes Euclid's algorithm? I'll admit that I do not understand quite what the optimiser has done to the code. The Euclidean gcd() function is pretty easy to recognise in the assembler output: _TEXT SEGMENT a$ = 48 b$ = 56 gcd PROC ; 5 : U64 gcd( U64 a, U64 b ) { ; a in rcx, b +in rdx $LN5: sub rsp, 40 ; reserve some stack space mov r8, rdx ; put a copy of b i +n r8 ; 6 : return !b ? a : gcd( b, a % b ); mov rax, rcx ; put a in rax (res +ult register) test rdx, rdx ; test b je SHORT $LN4@gcd ; jump to return if +it is 0; ie. if b=0 return a xor edx, edx ; empty rdx ready f +or div mov rcx, r8 ; move b into rcx ( + ie. b becomes a for recursive call ) div r8 ; divide rdx:rax (0 +:a) by r8 (b); leaves quotient in rax, the remainder (modulus) in rdx + (thus b in the recursive call) call gcd ; recurse $LN4@gcd: ; result is in + rax. ; 7 : } add rsp, 40 ; clean up the stack ret 0 ; return popping 0 +bytes from the stack gcd ENDP _TEXT ENDS [download] ~~Which is well and good, but the the code in the loop that (should be) calling this function is very confusing~~ Realisation dawns! Without the verification code, the compiler recognises that the value calculated within the loop is never used outside of it, and simply optimises the entire loop away: `; 35 : start = __rdtsc(); rdtsc shl rdx, 32 ; 00000020H or rax, rdx mov rcx, rax ; 36 : for( r = 1; r < ( s >> 1 ); ++r ) { ; 37 : ( gsm = s - ( s % lcm( r, 4096 ) ) ); ; 38 : } ; 39 : end = __rdtsc(); rdtsc shl rdx, 32 ; 00000020H` [download] In other words, the times output are the time it takes to make two calls to _rdtsc(). With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked _	[reply] [d/l] [select]

Realisation dawns!