The x86 `loop` instruction is slow; never use it. See this Stack Overflow post about LOOP being slow: https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently
Write your function in C, using GNU C __builtin_popcountll, and compile with -march=native (to use the popcnt instruction if your CPU supports it)..
Or if you really want NASM, your function only works if used with a whole number of 64-bit chunks, so I changed the signature to int64_t*. The count is still a byte-count, though.
; Untested (check for off-by-one array indexing) ; Probably about 3x as fast as the above for large arrays ; on Intel Sandybridge-family. Unrolling by 2 or so would help. ; prototype: unsigned long popcount(int64_t* rdi, int64_t rsi); ; requires byte count in rsi >= 8, and a multiple of 8. ALIGN 16 global popcount popcount: add rdi, rsi ; rdi = end of buffer neg rsi ; negative index counts up to zero xor rax, rax ;do { .cnt mov rdx, [rdi+rsi] ; mov-load breaks the dependency popcnt rdx, rdx ; on rdx from last iteration. add rax, rdx add rsi, 8 ; index+=8 jl .cnt ;}while(index<0) ret
popcnt on Intel Sandybridge-family has a false dependency on the destination. Using a separate load avoids creating a loop-carried dependency on the 3-cycle latency of popcnt.
Using an indexed addressing mode saves an instruction inside the loop. Checking the count and using an unrolled loop would gain speed, especially on AMD Ryzen where popcnt has 4 per clock throughput (so this will bottleneck on the front-end instead of on 2 loads per clock).
It's 4 fused-domain uops on Intel Sandybridge-family, though (since add / jl can macro-fuse), so it can run at about 1 per clock. It should be at least 3x faster than the old version, which bottlenecks on the popcnt false dependency on Sandybridge-family, or on the LOOP instruction front-end throughput. (Although the old version would actually run ok on AMD Bulldozer-family, where LOOP is fast and POPCNT has no false dependency).
If you didn't know all this stuff, you should have written your function in C and let the compiler get it right. Or read Agner Fog's optimization guide, instruction tables, and microarch guide (http://agner.org/optimize/). Also, Clang would auto-vectorize for you with AVX2, using vpshufb to count faster (on Haswell and later) than you can with popcnt. By all means look at the C compiler output to make sure you got what you wanted, though.
Peter Cordes (https://stackoverflow.com/users/224132/peter-cordes)
In reply to Re^2: Bitwise operators on binary objects
by PeterCordes
in thread Bitwise operators on binary objects
by albert
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |