in reply to Re: Simple arithmetic? (And the winner is ... )
in thread Simple arithmetic?

Well that's a weird result BUK :) So I actually compiled your code (with some changes to make it suitable for gcc on linux). I was getting strange results too. 1500 nanoseconds for anonM algorithm, no matter what s was (at -O2). I'm curious why that happened. With some checksums to verify results the execution time became more realistic (with 1:10 bias towards bitshift stuff, but perhaps your compiler recognizes Euclid's algorithm? gcc 4.7 doesn't, it seems :) I don't know C very well, though (so here it is)
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <inttypes.h> #include <time.h> typedef uint64_t U64; struct timespec diff_time( struct timespec end, struct timespec start ) { struct timespec res; res.tv_sec = end.tv_sec - start.tv_sec; res.tv_nsec = end.tv_nsec - start.tv_nsec; if ( res.tv_nsec < 0 ) { res.tv_nsec += 1000 * 1000 * 1000; res.tv_sec -= 1; } return res; } U64 gcd( U64 a, U64 b ) { return !b ? a : gcd( b, a % b ); } U64 lcm( U64 a, U64 b ) { return a * b / gcd( a, b ); } U64 gcm( U64 max, U64 n ) { U64 c, lcm; for( c = 1; ( ~n & 1 ) && ( c < 4096 ); c <<= 1 ) n >>= 1; lcm = n * 4096; return ( max / lcm ) * lcm; } #define GB ( 1024ull * 1024ull * 1024ull ) #define PAGE 4096 int main(int argc, char **argv) { U64 r, chk; U64 s = argc > 1 ? strtoull( argv[1], NULL, 10 ) : GB; struct timespec start, end, difft; clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &start ); for( chk = 0, r = 1; r < ( s >> 1 ); ++r ) { chk += gcm( s, r ); } clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &end ); difft = diff_time( end, start ); printf( "anonyM: gcm for s=%" PRIu64 " & r=1 to %" PRIu64 " took:%lld secs %ld nsecs\n" "chk = %" PRIu64 "\n", s, s>>1, (long long) difft.tv_sec, difft.tv_nsec, chk ); clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &start ); for( chk = 0, r = 1; r < ( s >> 1 ); ++r ) { chk += s - ( s % lcm( r, 4096 ) ); } clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &end ); difft = diff_time( end, start ); printf( "oiskuu: gcm for s=%" PRIu64 " & r=1 to %" PRIu64 " took:%lld secs %ld nsecs\n" "chk = %" PRIu64 "\n", s, s>>1, (long long) difft.tv_sec, difft.tv_nsec, chk ); return 0; }

Replies are listed 'Best First'.
Re^3: Simple arithmetic? (And the (real) winner is ... )
by BrowserUk (Patriarch) on Mar 08, 2015 at 23:00 UTC

    Anonymonk's implementation by an order of magnitude over oiskuu's:

    C:\test\C>gcm anonyM: gcm for s=2147483648 & r=1 to 1073741824 took: 33.607916866993 + oiskuu: gcm for s=2147483648 & r=1 to 1073741824 took:329.551997991197

    Thanks for the sanity check AnonyMonk!

    And the difference between this benchmark and the previous (unbelievable) one? One keyword:

    void main( int argc, char **argv ) { U64 s = argc > 1 ? _atoi64( argv[ 1 ] ) : 2*GB; U64 r, start, end; U64 volatile gsm; // ^^^^^^^^ *** Prevent the co +mpiler from optimising the loop away in its entirety *** D'oh! start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { gsm = gcm( s, r ); } end = __rdtsc(); printf( "anonyM: gcm for s=%I64u & r=1 to %I64u took:%.12f [%I64u] +\n", s, s>>1, (double)( end - start ) / 2394046359.0 ); start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { gsm = s - ( s % lcm( r, 4096 ) ); } end = __rdtsc(); printf( "oiskuu: gcm for s=%I64u & r=1 to %I64u took:%.12f [%I64u] +\n", s, s>>1, (double)( end - start ) / 2394046359.0 ); return; }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      Wow. At one point I almost suspected something like that, and even tried to change the order of algorithms, but... that didn't change anything (for some reason), and why the hell the compiler would optimize away just one loop? That's really obscure (**** compilers, how do they work? :)
        why the hell the compiler would optimize away just one loop? That's really obscure

        I have absolutely no idea. None. Zip. Nada!

        Whilst x64 asm is still fairly new to me, and the syntax and opcodes are sufficiently different from x86 to make it difficult to read at times -- especially with the interleaving of opcodes to keep the pipelines busy -- but I've been inspecting the asm output from compilers long enough now that I can usually postulate a reason why they optimise things in a particular way; but this has me stumped.

        Why for two essentially similar loops, one would get optimised away and the other not...


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^3: Simple arithmetic? (And the winner is ... )
by BrowserUk (Patriarch) on Mar 08, 2015 at 22:47 UTC
    Well that's a weird result BUK :) So I actually compiled your code (with some changes to make it suitable for gcc on linux). I was getting strange results too. 1500 nanoseconds for anonM algorithm, no matter what s was (at -O2). I'm curious why that happened. With some checksums to verify results the execution time became more realistic (with 1:10 bias towards bitshift stuff,

    I understand your reaction. I had the same. I spent 4 hours trying to find something amiss. I failed.

    If you look at the commented out verification code in what I posted above (with extra comments):

    void main( int argc, char **argv ) { U64 s = argc > 1 ? _atoi64( argv[ 1 ] ) : 2*GB; U64 r, start, end, gsm; // U64 *res = malloc( sizeof(U64) * (s>>1) ); + //// An array to store every result from loop 1 (anonyM) start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { // res[ r ] = + //// Assign the result to the array gsm = gcm( s, r ); } end = __rdtsc(); printf( "anonyM: gcm for s=%I64u & r=1 to %I64u took:%.9f\n", s, s +>>1, (double)( end - start ) / 2394046359.0 ); start = __rdtsc(); for( r = 1; r < ( s >> 1 ); ++r ) { // if( res[ r ] != + //// If the result from the first loop, doesn't match the result fro +m this loop ( gsm = s - ( s % lcm( r, 4096 ) ) ) // ) printf( "%I64u != %I64u\n", gsm, res[ r ] ) + ///// Yell about it. ; } end = __rdtsc(); printf( "oiskuu: gcm for s=%I64u & r=1 to %I64u took:%.9f\n", s, s +>>1, (double)( end - start ) / 2394046359.0 ); return; }

    Of course, the additional code slows things down and changes the timing completely, but not a single mismatch is found.

    I can't think of a more comprehensive way to verify the algorithms than to directly compare their results for a full range of inputs?

    but perhaps your compiler recognizes Euclid's algorithm?

    I'll admit that I do not understand quite what the optimiser has done to the code.

    The Euclidean gcd() function is pretty easy to recognise in the assembler output:

    _TEXT SEGMENT a$ = 48 b$ = 56 gcd PROC ; 5 : U64 gcd( U64 a, U64 b ) { ; a in rcx, b +in rdx $LN5: sub rsp, 40 ; reserve some stack space mov r8, rdx ; put a copy of b i +n r8 ; 6 : return !b ? a : gcd( b, a % b ); mov rax, rcx ; put a in rax (res +ult register) test rdx, rdx ; test b je SHORT $LN4@gcd ; jump to return if +it is 0; ie. if b=0 return a xor edx, edx ; empty rdx ready f +or div mov rcx, r8 ; move b into rcx ( + ie. b becomes a for recursive call ) div r8 ; divide rdx:rax (0 +:a) by r8 (b); leaves quotient in rax, the remainder (modulus) in rdx + (thus b in the recursive call) call gcd ; recurse $LN4@gcd: ; result is in + rax. ; 7 : } add rsp, 40 ; clean up the stack ret 0 ; return popping 0 +bytes from the stack gcd ENDP _TEXT ENDS

    Which is well and good, but the the code in the loop that (should be) calling this function is very confusing

    Realisation dawns!

    Without the verification code, the compiler recognises that the value calculated within the loop is never used outside of it, and simply optimises the entire loop away:

    ; 35 : start = __rdtsc(); rdtsc shl rdx, 32 ; 00000020H or rax, rdx mov rcx, rax ; 36 : for( r = 1; r < ( s >> 1 ); ++r ) { ; 37 : ( gsm = s - ( s % lcm( r, 4096 ) ) ); ; 38 : } ; 39 : end = __rdtsc(); rdtsc shl rdx, 32 ; 00000020H

    In other words, the times output are the time it takes to make two calls to _rdtsc().


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
    _