in reply to Bitwise operators on binary objects

Newer processors have a popcnt ("population count") instruction, which computes the number of 1-bits (for up to 64 bits) with a single fast CPU instruction.  In other words, if you want real speed at the cost of portability, you could hack a little assembly1,2.

A quick test shows that this is roughly 10 times as fast as the above mentioned unpack "%32b*".

#!/usr/bin/perl -wl use strict; use Inline C => Config => LDDLFLAGS => '-shared ../../../popcount.o'; use Inline "C"; use Benchmark "cmpthese"; my $s = pack "C*", map { int rand 256 } 1 .. 2**18; # 256k #print unpack q/%32b*/, $s; #print count_bits($s); cmpthese (-2, { unpack => sub { my $n = unpack q/%32b*/, $s; }, popcnt => sub { my $n = count_bits($s); }, }); __END__ __C__ extern unsigned long popcount(char *, int64_t); // the asm routine unsigned long count_bits (SV* bitstr) { // XS wrapper STRLEN l; char *s = SvPV(bitstr, l); return popcount(s, l); }
Rate unpack popcnt unpack 1252/s -- -91% popcnt 13581/s 984% --

Tested with

$ cat /proc/cpuinfo | grep name | head -1 model name : AMD Phenom(tm) 9600 Quad-Core Processor

___

1 The assembly code (nasm syntax; 64 bit; gcc register subroutine calling convention — compile with nasm -f elf64 popcount.asm ):

bits 64 global popcount section .text ; prototype: unsigned long popcount(char*, int64_t); popcount: push rbp mov rbp, rsp mov rcx, rsi shr rcx, 3 mov rsi, rdi xor rax, rax .cnt popcnt rbx, [rsi] add rax, rbx add rsi, 8 loop .cnt mov rsp, rbp pop rbp ret

(Caveat: as this is just proof of concept code, the string length is assumed to be a multiple of 8 byte/64 bit, so you might need to zero-pad it)

2 In theory, there is also the gcc __builtin_popcount function, but in practice it turns out to be less portable than a piece of assembly code.

Replies are listed 'Best First'.
Re^2: Bitwise operators on binary objects
by PeterCordes (Initiate) on Sep 22, 2017 at 09:31 UTC

    The x86 `loop` instruction is slow; never use it. See this Stack Overflow post about LOOP being slow: https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently

    Write your function in C, using GNU C __builtin_popcountll, and compile with -march=native (to use the popcnt instruction if your CPU supports it)..

    Or if you really want NASM, your function only works if used with a whole number of 64-bit chunks, so I changed the signature to int64_t*. The count is still a byte-count, though.

    ; Untested (check for off-by-one array indexing) ; Probably about 3x as fast as the above for large arrays ; on Intel Sandybridge-family. Unrolling by 2 or so would help. ; prototype: unsigned long popcount(int64_t* rdi, int64_t rsi); ; requires byte count in rsi >= 8, and a multiple of 8. ALIGN 16 global popcount popcount: add rdi, rsi ; rdi = end of buffer neg rsi ; negative index counts up to zero xor rax, rax ;do { .cnt mov rdx, [rdi+rsi] ; mov-load breaks the dependency popcnt rdx, rdx ; on rdx from last iteration. add rax, rdx add rsi, 8 ; index+=8 jl .cnt ;}while(index<0) ret

    popcnt on Intel Sandybridge-family has a false dependency on the destination. Using a separate load avoids creating a loop-carried dependency on the 3-cycle latency of popcnt.

    Using an indexed addressing mode saves an instruction inside the loop. Checking the count and using an unrolled loop would gain speed, especially on AMD Ryzen where popcnt has 4 per clock throughput (so this will bottleneck on the front-end instead of on 2 loads per clock).

    It's 4 fused-domain uops on Intel Sandybridge-family, though (since add / jl can macro-fuse), so it can run at about 1 per clock. It should be at least 3x faster than the old version, which bottlenecks on the popcnt false dependency on Sandybridge-family, or on the LOOP instruction front-end throughput. (Although the old version would actually run ok on AMD Bulldozer-family, where LOOP is fast and POPCNT has no false dependency).

    If you didn't know all this stuff, you should have written your function in C and let the compiler get it right. Or read Agner Fog's optimization guide, instruction tables, and microarch guide (http://agner.org/optimize/). Also, Clang would auto-vectorize for you with AVX2, using vpshufb to count faster (on Haswell and later) than you can with popcnt. By all means look at the C compiler output to make sure you got what you wanted, though.

    Peter Cordes (https://stackoverflow.com/users/224132/peter-cordes)

Re^2: Bitwise operators on binary objects
by albert (Monk) on Feb 01, 2011 at 11:35 UTC
    Assembly is beyond me. This is a cool suggestion, but the processor on my machine isn't new enough to have this call. Plus, I'm fairly sure that I have other significant bottle necks in my code, so not sure I'd benefit nearly so well from moving part to assembly. Still, a very nice idea.