comment on

The x86 `loop` instruction is slow; never use it. See this Stack Overflow post about LOOP being slow: https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently

Write your function in C, using GNU C __builtin_popcountll, and compile with -march=native (to use the popcnt instruction if your CPU supports it)..

Or if you really want NASM, your function only works if used with a whole number of 64-bit chunks, so I changed the signature to int64_t*. The count is still a byte-count, though.

; Untested (check for off-by-one array indexing)
; Probably about 3x as fast as the above for large arrays
; on Intel Sandybridge-family.  Unrolling by 2 or so would help.

; prototype: unsigned long popcount(int64_t* rdi, int64_t rsi);
; requires byte count in rsi >= 8, and a multiple of 8.
ALIGN 16
global popcount
popcount:
        add    rdi, rsi           ; rdi = end of buffer
        neg    rsi                ; negative index counts up to zero
        xor    rax, rax
                                  ;do {
.cnt    mov    rdx, [rdi+rsi]      ; mov-load breaks the dependency
        popcnt rdx, rdx            ;  on rdx from last iteration.
        add    rax, rdx
        add    rsi, 8             ; index+=8
        jl    .cnt                ;}while(index<0)
        ret
[download]

popcnt on Intel Sandybridge-family has a false dependency on the destination. Using a separate load avoids creating a loop-carried dependency on the 3-cycle latency of popcnt.

Using an indexed addressing mode saves an instruction inside the loop. Checking the count and using an unrolled loop would gain speed, especially on AMD Ryzen where popcnt has 4 per clock throughput (so this will bottleneck on the front-end instead of on 2 loads per clock).

It's 4 fused-domain uops on Intel Sandybridge-family, though (since add / jl can macro-fuse), so it can run at about 1 per clock. It should be at least 3x faster than the old version, which bottlenecks on the popcnt false dependency on Sandybridge-family, or on the LOOP instruction front-end throughput. (Although the old version would actually run ok on AMD Bulldozer-family, where LOOP is fast and POPCNT has no false dependency).

If you didn't know all this stuff, you should have written your function in C and let the compiler get it right. Or read Agner Fog's optimization guide, instruction tables, and microarch guide (http://agner.org/optimize/). Also, Clang would auto-vectorize for you with AVX2, using vpshufb to count faster (on Haswell and later) than you can with popcnt. By all means look at the C compiler output to make sure you got what you wanted, though.

Peter Cordes (https://stackoverflow.com/users/224132/peter-cordes)

In reply to Re^2: Bitwise operators on binary objects by PeterCordes
in thread Bitwise operators on binary objects by albert

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.