in reply to Re^2: Simple arithmetic? (And the winner is ... )
in thread Simple arithmetic?

So, probably it's for the best to use both noinline attribute and an optimization barrier.

The problem is that nothing I do inside the function, whether inlined or not, will prevent the compiler optimising the loop away. The decision is, (appears to be), that because the variable to which the result of the function is assigned is never used outside the loop, and the function has no side effects, the loop is redundant. (Though I haven't worked out why the second loop isn't optimised away for similar reasons.

The use of volatile on that variable works because the compiler cannot decide that the value is never used. Whilst volatile can have other effects upon the code generated, these mostly relate to multi-threaded code which is not in play here. Also, the MS compiler has extended guarantees with regard to volatile (on non-ARM systems):

/volatile:ms

Selects Microsoft extended volatile semantics, which add memory ordering guarantees beyond the ISO-standard C++ language. Acquire/release semantics are guaranteed on volatile accesses. However, this option also forces the compiler to generate hardware memory barriers, which might add significant overhead on ARM and other weak memory-ordering architectures. If the compiler targets any platform except ARM, this is default interpretation of volatile.

Conversely, the read/write/ReadWrite/Barrier intrinsics are now deprecated:

The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics and the MemoryBarrier macro are all deprecated and should not be used. For inter-thread communication, use mechanisms such as atomic_thread_fence and std::atomic<T>, which are defined in the C++ Standard Library Reference. For hardware access, use the /volatile:iso compiler option together with the volatile (C++) keyword.

The volatile keyword also seems to impose lesser, but enough, constraints. (That's an interpretation, rather than an MS stated fact.)

Now, getting back to the original topic. Lcm with 4096 means simply to have twelve zeroes at the end. I'd code it like this:
$n <<= 1 while $n & 0xfff;

Update:

Um. That code doubles $n whilst it's lower than 4096; the original code calls for halving it; at most 12 times?

And if I just switched the shift direction, an input $n of 0xffffffffff; would be reduced to 0 before the loop terminated.

Ignore the above, I left the lcm = n * 40096; in, where (I assume) you meant to replace:

for( c = 1; ( ~n & 1 ) && ( c < 4096 ); c <<= 1 ) n >>= 1; lcm = n * 4096;

With:

while( n & 0xfff ) n <<= 1; lcm = n;

Which works, but takes twice as long as the original the original version:

C:\test\C>gcm gcm : 2132901888 gcm2: 2132901888 gcm3: 2132901888 anonyM: gcm for s=2147483648 & r=1 to 1073741824 took:33.850023994460 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took:46.293298113614 anonyM: gcm3 for s=2147483648 & r=1 to 1073741824 took:64.208097030422

/Update:

(Counting trailing zeroes can also be optimized.)

I thought about that last night and tried using the __BitScanForward64() intrinsic:

U64 gcm2( U64 max, U64 n ) { U32 b; U64 c, lcm; _BitScanForward64( &b, n ); n >>= min( b, 12 ); lcm = n * 4096; return ( max / lcm ) * lcm; }

Which looked like it should be more efficient, compiling to this:

Rather than this:

But the reality turns out to be disappointingly about 50% slower:

C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took: 33.92063749171 +5 anonyM: gcm2 for s=2147483648 & r=1 to 1073741824 took: 46.15194765908 +9 oiskuu: gcm for s=2147483648 & r=1 to 1073741824 took:330.49201177311 +0

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked