in reply to Re^3: The 10**21 Problem (Part 3)
in thread The 10**21 Problem (Part 3)
I ran your code as is and it took 47 seconds, ten seconds slower. I then changed:
back to my original:_mm_prefetch(&bytevecM[i], _MM_HINT_T0); _mm_prefetch(&bytevecM[i^64], _MM_HINT_T0);
and it ran in 38 seconds, only one second slower. Note that the &0xffffff80 aligns on a 64 byte boundary while ensuring we get the two 64 byte cache lines required for the inner loop._mm_prefetch(&bytevecM[(unsigned int)(i) & 0xffffff80], _MM_HINT_T0); _mm_prefetch(&bytevecM[64+((unsigned int)(i) & 0xffffff80)], _MM_HINT_ +T0);
I profiled with VTune and both my (37 second) and your (38 second) solution showed up as having two (seven second) hotspots -- presumably due to memory latency -- in the same places, namely here:
and here:; 100 : UNROLL(q8) 0x1400028e0 Block 178: 0x1400028e0 mov eax, r9d 7.217s 0x1400028e3 xor rax, rdi 0.060s 0x1400028e6 movzx r10d, byte ptr [rax+rsi*1] 0.100s 0x1400028eb test r10d, r10d 2.508s 0x1400028ee jz 0x140002a0b <Block 192>
; 99 : for (q8 = 14; q8 < 128; ++q8) { 0x140002a0b Block 192: 0x140002a0b inc r9d 7.008s 0x140002a0e cmp r9d, 0x80 0.690s 0x140002a15 jl 0x1400028e0 <Block 178>
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^5: The 10**21 Problem (Part 3)
by oiskuu (Hermit) on May 17, 2014 at 22:05 UTC | |
by eyepopslikeamosquito (Archbishop) on May 17, 2014 at 23:13 UTC |