Re^2: Simple arithmetic? (And the winner is ... )

"I wuz in ur um-puter, eating ur loopz."

I don't know how MSVC handles things, but generally there are other ways to prevent unwanted benchmark optimizations. Using volatile is almost never necessary, and very often has different implications to what was intended. Just one example, quoting gcc.info:

Accesses to non-volatile objects are not ordered with respect to volatile accesses. You cannot use a volatile object as a memory barrier to order a sequence of writes to non-volatile memory.

Usually, a barrier of some sort is indicated. With gcc/icc/clang, the following optimization barrier should work:

#define barrier() asm volatile("": : :"memory")
[download]
(Here the volatile may be due to some compiler bugs.)

The ways to prevent a compiler of being too smart:

Put the test routine(s) in separate files ("compilation units"), link together. But nowadays we can have link time optimization (-flto), so care must be taken (incremental linking for those parts that want lto).
Mark your test routines with __attribute__((__noinline__)), so they don't get inlined. Result should be similar to having them in separate units. However, this does NOT prevent e.g. clang determining that the function is a nop, and optimizing it out!
Use optimization barrier(s) in the loop and/or test routine. This will make sure the loop stays, even if the body is a nop. Barrier in the test routine might be necessary, lest the compiler discover and hoist a "const" function.
Have the main loop calculate some sort of result. In other words, the program as a whole needs to have a side effect. E.g. if the test routine returns int, one might compute a checksum and print that in the end; will also serve as verification.

So, probably it's best to use both noinline attribute and an optimization barrier.

Now, getting back to the original topic. Lcm with 4096 means simply to have twelve zeroes at the end. I'd code it like this:

$n <<= 1 while $n & 0xfff;
[download]

(Counting trailing zeroes can also be optimized. Some x86 (Haswell) have an instruction for that. Wikipedia links to Bit Twiddling Hacks; there are other sites and books on the topic.)

Update: added the 4th clause to above list.

Comment on Re^2: Simple arithmetic? (And the winner is ... ) Select or Download Code

Replies are listed 'Best First'.
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 12:06 UTC
So, probably it's for the best to use both noinline attribute and an optimization barrier. The problem is that nothing I do inside the function, whether inlined or not, will prevent the compiler optimising the loop away. The decision is, (appears to be), that because the variable to which the result of the function is assigned is never used outside the loop, and the function has no side effects, the loop is redundant. (Though I haven't worked out why the second loop isn't optimised away for similar reasons. The use of volatile on that variable works because the compiler cannot decide that the value is never used. Whilst volatile can have other effects upon the code generated, these mostly relate to multi-threaded code which is not in play here. Also, the MS compiler has extended guarantees with regard to volatile (on non-ARM systems): /volatile:ms Selects Microsoft extended volatile semantics, which add memory ordering guarantees beyond the ISO-standard C++ language. Acquire/release semantics are guaranteed on volatile accesses. However, this option also forces the compiler to generate hardware memory barriers, which might add significant overhead on ARM and other weak memory-ordering architectures. If the compiler targets any platform except ARM, this is default interpretation of volatile. Conversely, the read/write/ReadWrite/Barrier intrinsics are now deprecated: The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics and the MemoryBarrier macro are all deprecated and should not be used. For inter-thread communication, use mechanisms such as atomic_thread_fence and std::atomic<T>, which are defined in the C++ Standard Library Reference. For hardware access, use the /volatile:iso compiler option together with the volatile (C++) keyword. The volatile keyword also seems to impose lesser, but enough, constraints. (That's an interpretation, rather than an MS stated fact.) Now, getting back to the original topic. Lcm with 4096 means simply to have twelve zeroes at the end. I'd code it like this: `$n <<= 1 while $n & 0xfff;` [download] Update: ~~Um. That code doubles $n whilst it's lower than 4096; the original code calls for halving it; at most 12 times?~~ ~~And if I just switched the shift direction, an input $n of 0xffffffffff; would be reduced to 0 before the loop terminated.~~ Ignore the above, I left the `lcm = n * 40096;` in, where (I assume) you meant to replace: `for( c = 1; ( ~n & 1 ) && ( c < 4096 ); c <<= 1 ) n >>= 1; lcm = n * 4096;` [download] With: `while( n & 0xfff ) n <<= 1; lcm = n;` [download] Which works, but takes twice as long as the original the original version: `C:\test\C>gcm gcm : 2132901888 gcm2: 2132901888 gcm3: 2132901888 anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:33.850023994460 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took:46.293298113614 anonyM: gcm3 for s=2147483648 & r=1 to 1073741824 took:64.208097030422` [download] /Update: (Counting trailing zeroes can also be optimized.) I thought about that last night and tried using the __BitScanForward64() intrinsic: `U64 gcm2( U64 max, U64 n ) { U32 b; U64 c, lcm; _BitScanForward64( &b, n ); n >>= min( b, 12 ); lcm = n * 4096; return ( max / lcm ) * lcm; }` [download] Which looked like it should be more efficient, compiling to this: `PUBLIC gcm2 ; Function compile flags: /Ogtpy _TEXT SEGMENT max$ = 8 n$ = 16 gcm2 PROC ; 24 : U32 b; ; 25 : U64 c, lcm; ; 26 : ; 27 : _BitScanForward64( &b, n ); bsf rax, rdx mov r8, rcx mov r9, rdx ; 28 : n >>= min( b, 12 ); mov ecx, 12 cmp eax, ecx cmovb ecx, eax ; 29 : lcm = n * 4096; ; 30 : return ( max / lcm ) * lcm; xor edx, edx mov rax, r8 shr r9, cl shl r9, 12 div r9 imul rax, r9 ; 31 : } ret 0 gcm2 ENDP` [download] Rather than this: PUBLIC gcm ; Function compile flags: /Ogtpy _TEXT SEGMENT max$ = 8 n$ = 16 gcm PROC ; 16 : U64 c, lcm; ; 17 : ; 18 : for( c = 1; ( ~n & 1 ) && ( c < 4096 ); c <<= 1 ) n >>= 1 +; movzx eax, dl mov r8d, 1 mov r9, rdx not al test al, r8b je SHORT $LN1@gcm $LL3@gcm: cmp r8, 4096 ; 00001000H jae SHORT $LN1@gcm shr r9, 1 add r8, r8 movzx eax, r9b not al test al, 1 jne SHORT $LL3@gcm $LN1@gcm: ; 19 : lcm = n * 4096; shl r9, 12 ; 20 : return ( max / lcm ) * lcm; xor edx, edx mov rax, rcx div r9 imul rax, r9 ; 21 : } ret 0 gcm ENDP [download] But the reality turns out to be disappointingly about 50% slower: `C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took: 33.92063749171 +5 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took: 46.15194765908 +9 oiskuu: gcm for s=2147483648 & r=1 to 1073741824 took:330.49201177311 +0` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 13:51 UTC
FWIW: Of many things I've tried to improve on Anonymous Monk's gcm(), the only thing that has worked is this: `U64 gcm4( U64 max, U64 n ) { U64 c, lcm; for( c = 1; ( ~n & 1 ) && ( c < 12 ); ++c ) n >>= 1; lcm = n * 4096; return ( max / lcm ) * lcm; }` [download] Incrementing seems to be fractionally faster than shifting: `anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:33.850345514550 + ### original anonyM: gcm4 for s=2147483648 & r=1 to 1073741824 took:33.590549831120 + ### my tweak` [download] But the difference, 1/4 second on > 1 billions calls is so minimal (0.2 nanoseconds), and the shifting better captures the semantics, I've stuck with the original. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: Simple arithmetic? (And the winner is ... ) by BrowserUk (Patriarch) on Mar 09, 2015 at 14:36 UTC
And no sooner have I said that, and I think of a way to use _BitScanForward64() that improves upon the original by 70+%: `C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:8.535316994670 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took:2.271266720696` [download] Pre-masking the scanned var, avoids the later subtraction being conditional: `U64 gcm2( U64 max, U64 lcm ) { I32 b; _BitScanForward64( &b, lcm & 0xfff ); lcm <<= ( 12 - b ); return ( max / lcm ) * lcm; }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^4: Simple arithmetic? (And the winner is ... ) by oiskuu (Hermit) on Mar 09, 2015 at 21:52 UTC
Fixed version with a quick demonstration of the problem: #include <stdio.h> #include <stdint.h> #include <stdlib.h> uint64_t gcm2( uint64_t max, uint64_t lcm ) { int b = __builtin_ctzl( lcm & 0xfff ); lcm <<= ( 12 - b ); return ( max / lcm ) * lcm; } uint64_t gcm(uint64_t max, uint64_t lcm) { lcm <<= 12 - __builtin_ctzl(lcm \| 0x1000); return max - (max % lcm); } int main(int argc, char *argv[]) { unsigned long max = 12345 << 10; while (argc-- > 1) { unsigned long v = strtoul(argv[argc], NULL, 10); printf("gcm2(%lu,%lu) = %lu\n", max, v, gcm2(max, v)); printf("gcm(%lu,%lu) = %lu\n", max, v, gcm(max, v)); } return 0; } [download] $ ./a.out 12 77 4096 gcm2(12641280,4096) = 0 gcm(12641280,4096) = 12640256 gcm2(12641280,77) = 12615680 gcm(12641280,77) = 12615680 gcm2(12641280,12) = 12632064 gcm(12641280,12) = 12632064	[reply] [d/l]