Well that's a weird result BUK :) So I actually compiled your code (with some changes to make it suitable for gcc on linux). I was getting strange results too. 1500 nanoseconds for anonM algorithm, no matter what s was (at -O2). I'm curious why that happened. With some checksums to verify results the execution time became more realistic (with 1:10 bias towards bitshift stuff,
I understand your reaction. I had the same. I spent 4 hours trying to find something amiss. I failed.
If you look at the commented out verification code in what I posted above (with extra comments):
void main( int argc, char **argv ) { U64 s = argc > 1 ? _atoi64( argv[ 1 ] ) : 2*GB; U64 r, start, end, gsm; // U64 *res = malloc( sizeof(U64) * (s>>1) ); + //// An array to store every result from loop 1 (anonyM) start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { // res[ r ] = + //// Assign the result to the array gsm = gcm( s, r ); } end = __rdtsc(); printf( "anonyM: gcm for s=%I64u & r=1 to %I64u took:%.9f\n", s, s +>>1, (double)( end - start ) / 2394046359.0 ); start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { // if( res[ r ] != + //// If the result from the first loop, doesn't match the result fro +m this loop ( gsm = s - ( s % lcm( r, 4096 ) ) ) // ) printf( "%I64u != %I64u\n", gsm, res[ r ] ) + ///// Yell about it. ; } end = __rdtsc(); printf( "oiskuu: gcm for s=%I64u & r=1 to %I64u took:%.9f\n", s, s +>>1, (double)( end - start ) / 2394046359.0 ); return; }
Of course, the additional code slows things down and changes the timing completely, but not a single mismatch is found.
I can't think of a more comprehensive way to verify the algorithms than to directly compare their results for a full range of inputs?
but perhaps your compiler recognizes Euclid's algorithm?
I'll admit that I do not understand quite what the optimiser has done to the code.
The Euclidean gcd() function is pretty easy to recognise in the assembler output:
_TEXT SEGMENT a$ = 48 b$ = 56 gcd PROC ; 5 : U64 gcd( U64 a, U64 b ) { ; a in rcx, b +in rdx $LN5: sub rsp, 40 ; reserve some stack space mov r8, rdx ; put a copy of b i +n r8 ; 6 : return !b ? a : gcd( b, a % b ); mov rax, rcx ; put a in rax (res +ult register) test rdx, rdx ; test b je SHORT $LN4@gcd ; jump to return if +it is 0; ie. if b=0 return a xor edx, edx ; empty rdx ready f +or div mov rcx, r8 ; move b into rcx ( + ie. b becomes a for recursive call ) div r8 ; divide rdx:rax (0 +:a) by r8 (b); leaves quotient in rax, the remainder (modulus) in rdx + (thus b in the recursive call) call gcd ; recurse $LN4@gcd: ; result is in + rax. ; 7 : } add rsp, 40 ; clean up the stack ret 0 ; return popping 0 +bytes from the stack gcd ENDP _TEXT ENDS
Which is well and good, but the the code in the loop that (should be) calling this function is very confusing
Without the verification code, the compiler recognises that the value calculated within the loop is never used outside of it, and simply optimises the entire loop away:
; 35 : start = __rdtsc(); rdtsc shl rdx, 32 ; 00000020H or rax, rdx mov rcx, rax ; 36 : for( r = 1; r < ( s >> 1 ); ++r ) { ; 37 : ( gsm = s - ( s % lcm( r, 4096 ) ) ); ; 38 : } ; 39 : end = __rdtsc(); rdtsc shl rdx, 32 ; 00000020H
In other words, the times output are the time it takes to make two calls to _rdtsc().
In reply to Re^3: Simple arithmetic? (And the winner is ... )
by BrowserUk
in thread Simple arithmetic?
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |