in reply to Weird performance issue with Strawberries and Inline::C

Here's more on compiler voodoo, though it involves a very trivial problem and primitive code; I either publish it (it's Perl related after all) or just throw it away. The PWC-342-2 was all about finding a maximum (an extremum, depends on POV) over cumulative sums. This is camouflaged, in my pure-Perl solution (below) by use of a regex; which is fastest (?) (over PWC GH directory) and readable OK. At the same time I wrote a PDL solution (rather, Perl/PDL combo); which is of course faster and no less (?) readable.

Then (2 weeks ago?), because confused about OOK and "4-args-substr" (not used below; and the confusion cleared now, thanks to the answers to another SOPW question), I wrote a "straightforward" C solution, followed by a tweaked (through simple arithmetic) -- "c_better" -- solution.

Further yet, I tried another -- Fortran -- language solution, pluggable directly into Perl's Benchmark harness thanks to Inline::C::Cookbook recipe. The c2f subroutine is almost exact, line-by-line translation of c_better to Fortran, and yet it's _faster_, don't know why, by the same margin regardless of -O2 or -O3 used to compile it, and despite of a C wrapper used to call it. (But, for the record, the "f" subroutine(s) require -O3 to be the fastest.)

Then, inspired by this success (?), I tried the direct, albeit perhaps mechanical, translation of PDL solution to Fortran. The latter doesn't have the cumulative sums intrinsic, so this one I had to do as an explicit loop. Unlike my C and "c2f", it performs several passes over input data (create a mask, count, transliterate, do cum sums, find a minimum) and stores intermediate arrays. However, it's _fastest_, for 1e4 and 1e5 input.

use strict; use warnings; use feature 'say'; use Config; use Benchmark 'cmpthese'; use lib '.'; use Max_score_subs ':all'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $str; my %tests = ( perl => sub { mscore_perl( $str )}, pdl => sub { mscore_pdl( $str )}, c => sub { mscore_c( $str )}, c_better => sub { mscore_c_better( $str )}, c2f => sub { mscore_c2f( $str )}, f => sub { mscore_f( $str )}, ); for my $L ( 1e3, 1e4, 1e5, 1e6, 1e7, 1e8 ) { say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr; $str = '1' x $L; substr $str, rand $L, 1 , '0' for 1 .. $L; delete $tests{ perl } if $L > 1e6; delete $tests{ pdl } if $L > 1e7; cmpthese -1, \%tests; } __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 1,000 Rate perl pdl c c_better f c2 +f perl 6637/s -- -90% -99% -99% -99% -99 +% pdl 67389/s 915% -- -91% -92% -92% -93 +% c 764229/s 11415% 1034% -- -8% -13% -16 +% c_better 830711/s 12416% 1133% 9% -- -5% -9 +% f 876596/s 13108% 1201% 15% 6% -- -4 +% c2f 911910/s 13640% 1253% 19% 10% 4% - +- String length: 10,000 Rate perl pdl c c_better c2f +f perl 635/s -- -94% -99% -99% -99% -99 +% pdl 11113/s 1651% -- -87% -88% -89% -90 +% c 86164/s 13476% 675% -- -9% -18% -26 +% c_better 94730/s 14826% 752% 10% -- -9% -19 +% c2f 104655/s 16390% 842% 21% 10% -- -10 +% f 116531/s 18261% 949% 35% 23% 11% - +- String length: 100,000 Rate perl pdl c c_better c2f f perl 63.1/s -- -95% -99% -99% -99% -99% pdl 1161/s 1741% -- -87% -88% -89% -91% c 8934/s 14061% 669% -- -8% -18% -28% c_better 9692/s 15262% 734% 8% -- -11% -22% c2f 10925/s 17216% 841% 22% 13% -- -12% f 12350/s 19475% 963% 38% 27% 13% -- String length: 1,000,000 Rate perl pdl f c c_better c2f perl 6.31/s -- -94% -99% -99% -99% -99% pdl 104/s 1554% -- -80% -88% -89% -90% f 535/s 8370% 412% -- -38% -45% -51% c 865/s 13600% 728% 62% -- -11% -20% c_better 968/s 15243% 828% 81% 12% -- -11% c2f 1085/s 17095% 940% 103% 26% 12% -- String length: 10,000,000 Rate pdl f c c_better c2f pdl 10.1/s -- -76% -89% -90% -90% f 41.5/s 311% -- -53% -57% -61% c 88.7/s 778% 114% -- -8% -16% c_better 96.4/s 854% 132% 9% -- -9% c2f 106/s 947% 155% 19% 10% -- String length: 100,000,000 Rate f c c_better c2f f 4.39/s -- -51% -54% -58% c 8.87/s 102% -- -7% -14% c_better 9.55/s 118% 8% -- -8% c2f 10.3/s 136% 17% 8% --

Max_score_subs.pm:

package Max_score_subs; use strict; use warnings; use feature 'say'; use PDL; use Exporter 'import'; our %EXPORT_TAGS = ( all => [ qw/ mscore_perl mscore_pdl mscore_c mscore_c_better mscore_c2f mscore_f /], ); Exporter::export_ok_tags( 'all' ); sub mscore_perl { my $s = shift; my $max = '0' eq substr $s, 0, 1; $s = substr $s, 1; $max += $s =~ tr/1//; chop $s; my $score = $max; while ( $s =~ /(0*)(1*)/g ) { $score += length $1; $max = $score if $score > $max; $score -= length $2 } return $max } sub mscore_pdl { my $s = shift; my $count = $s =~ tr/1//; chop $s; $s =~ tr/01/\xff\1/; my $p = zeroes sbyte, length $s; ${ $p-> get_dataref } = $s; $p-> upd_data; $count - $p-> cumusumover -> minover -> sclr } use FindBin; use Inline Config => force_build => 1; use Inline C => Config => libs => "-L$FindBin::Bin -lmscore_f -lgfortran -lquadmath"; use Inline C => << 'END_OF_C'; extern void c2f_( void* n, void* s, void* ret ); extern void f_( void* n, void* s, void* ret ); int mscore_c( SV* str ) { int i, acc, score, max, one; STRLEN len; char* buf = SvPV( str, len ); len --; acc = 0; score = 0; max = -1; for( i = 0; i < len; i ++ ) { one = buf[ i ] - '0'; acc += one; score += one ? -1 : 1; max = score > max ? score : max; } return max + acc + ( buf[ i ] == '1' ); } int mscore_c_better( SV* str ) { int i, acc, tmp, max; STRLEN len; char* buf = SvPVbyte( str, len ); len --; acc = 0; max = -2; for( i = 0; i < len; i ++ ) { acc += buf[ i ] - '0'; tmp = i - 2 * acc; max = tmp > max ? tmp : max; } return max + acc + ( buf[ len ] == '1' ) + 1; } int mscore_c2f( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ret; c2f_( &len, buf, &ret ); return ret; } int mscore_f( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ret; f_( &len, buf, &ret ); return ret; } END_OF_C 1;

mscore_f.f90:

subroutine c2f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: i, acc, mx, tmp acc = 0 mx = -2 do i = 1, n - 1 acc = acc + ( s( i ) - iachar( '0', 1 )) tmp = i - 1 - 2 * acc mx = max( mx, tmp ) end do ret = acc + mx + ( s( n ) - iachar( '0', 1 )) + 1 end subroutine c2f subroutine f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1 ! 97 integer :: i integer * 1 :: a( n ) ! temporary pad integer :: cum( n ) ! cumulative sums ret = count( s == iachar( '1', 1 )) ! tr/1// a = 2 * s - A97 ! tr/01/\xff\1/r a( n ) = 0 ! "chop", kind of cum( 1 ) = a( 1 ) do i = 2, n cum( i ) = cum( i - 1 ) + a( i ) end do ret = ret - minval( cum ) end subroutine f

The jitter for fast contestants for very short (1e3) input is perhaps within margin of error; but what about "f" getting slower for 1e6 and above? I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything?

Anyway, it then occurred to me, the scratch-pads can be much smaller and re-used many times, both for a single call (i.e. cumulative sums and minimums perhaps to be done in (quite) a few steps, following the consecutive nature of the problem all the same) and between calls, because declared as module variables i.e. the "mscore_f.f90" re-written as a module, per modern fashion:

module mscore_f implicit none private integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1 ! 97 integer, parameter :: CHUNK = 25000 integer * 1 :: a( CHUNK ) ! scratch-pad #1 integer :: cum( CHUNK ) ! scratch-pad #2 public :: f, c2f contains subroutine c2f( n, s, ret ) ! skipped, the same code end subroutine c2f subroutine f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: i, j, m, pre, L integer :: rprev ! "running previous" integer :: rmin ! "running minimum" ret = count( s == iachar( '1', 1 )) m = n / CHUNK if ( mod( n, CHUNK ) .ne. 0 ) then m = m + 1 end if rprev = 0; rmin = 2; do j = 1, m pre = ( j - 1 ) * CHUNK L = min( CHUNK, ubound( s, 1 ) - pre - 1 ) a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97 cum( 1 ) = rprev + a( 1 ) do i = 2, L cum( i ) = cum( i - 1 ) + a( i ) end do rmin = min( rmin, minval( cum( 1 : L ))) rprev = cum( L ) end do ret = ret - rmin end subroutine f end module mscore_f

And within Max_score_subs.pm:

extern void __mscore_f_MOD_c2f( void* n, void* s, void* ret ); extern void __mscore_f_MOD_f( void* n, void* s, void* ret );

Then, with relevant contestants, the results were quite a wow moment for me:

v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 1,000 Rate c_better c2f f c_better 829929/s -- -8% -17% c2f 905443/s 9% -- -10% f 1003704/s 21% 11% -- String length: 10,000 Rate c_better c2f f c_better 94179/s -- -11% -22% c2f 105326/s 12% -- -13% f 120618/s 28% 15% -- String length: 100,000 Rate c_better c2f f c_better 9683/s -- -12% -22% c2f 10983/s 13% -- -12% f 12418/s 28% 13% -- String length: 1,000,000 Rate c_better c2f f c_better 969/s -- -10% -22% c2f 1082/s 12% -- -13% f 1246/s 28% 15% -- String length: 10,000,000 Rate c_better c2f f c_better 95.1/s -- -9% -20% c2f 104/s 10% -- -12% f 119/s 25% 14% -- String length: 100,000,000 Rate c_better c2f f c_better 9.55/s -- -9% -18% c2f 10.5/s 10% -- -10% f 11.6/s 22% 11% --

Replies are listed 'Best First'.
Re^2: Weird performance issue with Strawberries and Inline::C
by Anonymous Monk on Oct 27, 2025 at 11:46 UTC
    I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything?

    At least it's easy to observe that memory is returned to OS upon return from a subroutine, where VLA (variable length array) is declared (and used) -- _if_ array gets large enough such as ~130K 64-bit integers. And _not_ returned to OS for smaller arrays. Don't know if/when other factors may be at play; and I doubt this factoid has any value. But it explains why the original "f" subroutine had been slower for 1e6 and longer input.

    Let's aim higher than ~40% gain through tweaks and compiler voodoo -- let's get parallel!

    use strict; use warnings; use feature 'say'; use Config; use Benchmark qw/ timeit :hireswallclock /; use lib '.'; use Max_score_subs ':all'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $str; my %tests = ( # C: c => sub { mscore_c( $str )}, # - dumb c_better => sub { mscore_c_better( $str )}, # - tweaked c_omp => sub { mscore_c_omp( $str )}, # - "c", parallel # # Fortran: c2f => sub { mscore_c2f( $str )}, # - "c_better" (in F) f => sub { mscore_f( $str )}, # - "PDL" (in F) f_omp => sub { mscore_f_omp( $str )}, # - "f", parallel ); my $iters = 2e5; my $fmt = '%8.0f'; for my $L ( 1e4, 1e5, 1e6, 1e7, 1e8 ) { say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr; $str = '1' x $L; substr $str, rand $L, 1 , '0' for 1 .. $L; my $any; my %ret; for my $key ( keys %tests ) { my $result = $tests{ $key }-> (); die "<<< $key!!! >>>" unless $result == ( $any //= $result ); my $t = timeit( $iters, $tests{ $key }); $ret{ $key } = $t-> iters / $t-> real } print " Rate/s %\n"; $fmt = '%8.1f' if $L > 1e6; for my $key ( sort { $ret{ $a } <=> $ret{ $b }} keys %ret ) { printf " %-9s $fmt %5.0f\n", $key, $ret{ $key }, 100 * $ret{ $key } / $ret{ c } } $iters /= 10 } __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 10,000 Rate/s % f_omp 23804 28 c_omp 25362 29 c 86100 100 c_better 91470 106 c2f 105239 122 f 120134 140 String length: 100,000 Rate/s % c 8724 100 c_better 9324 107 c2f 10686 122 f 12139 139 c_omp 15369 176 f_omp 15964 183 String length: 1,000,000 Rate/s % c 875 100 c_better 931 106 c2f 1068 122 f 1213 139 c_omp 2306 264 f_omp 2560 293 String length: 10,000,000 Rate/s % c 86.8 100 c_better 92.2 106 c2f 104.5 120 f 115.6 133 c_omp 349.6 403 f_omp 426.9 492 String length: 100,000,000 Rate/s % c 8.6 100 c_better 9.2 106 c2f 10.4 120 f 11.4 131 c_omp 34.7 401 f_omp 42.4 491

    What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch. In hindsight, the improved "c_better" and its translation "c2f" were a trap. With their loops, it's not obvious how to split them in independent parts. OTOH, for the "dumb" C version it's not too difficult to split:

    int mscore_c_omp( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); len --; int nparts = omp_get_num_procs(); if ( nparts > 4 ) nparts = 4; int psize = len / nparts; int acc_a[ nparts ]; int cum_a[ nparts ]; int max_a[ nparts ]; #pragma omp parallel for schedule( static, 1 ) \ num_threads( nparts ) for ( int j = 0; j < nparts; j ++ ) { int lo = j * psize; int hi = ( j == nparts - 1 ) ? len : lo + psize; int acc = 0; int cum = 0; int max = -1; for ( int i = lo; i < hi; i ++ ) { int one = buf[ i ] - '0'; acc += one; cum += one ? -1 : 1; if ( cum > max ) max = cum; } acc_a[ j ] = acc; cum_a[ j ] = cum; max_a[ j ] = max; } int acc = acc_a[ 0 ]; int max = max_a[ 0 ]; for ( int j = 1; j < nparts; j ++ ) { acc += acc_a[ j ]; cum_a[ j ] += cum_a[ j - 1 ]; max_a[ j ] += cum_a[ j - 1 ]; if ( max_a[ j ] > max ) max = max_a[ j ]; } return acc + max + ( buf[ len ] == '1' ); }

    In my tests, a string length of ~70k is a cut-off value where OMP versions begin to outrun the original versions. And it's better to just switch to originals rather than adding "if" directive to omp pragma, the above function is somewhat cumbersome and slow. Speaking about cumbersome, the Fortran OMP-enabled sub is a long way from original (elegant) PDL version. At least there remains ~25% gain over "c_omp", through the best of both worlds: parallel with OMP, and compiler voodoo with vectorization.

    subroutine f_omp( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: j, nparts, psize integer, allocatable :: acc_a ( : ), cum_a( : ), min_a( : ) nparts = omp_get_num_procs() if ( nparts > 4 ) nparts = 4 psize = n / nparts; allocate( acc_a( nparts )) allocate( cum_a( nparts )) allocate( min_a( nparts )) !$omp parallel do schedule( static, 1 ) num_threads( nparts ) do j = 1, nparts ; block integer * 1, allocatable :: a( : ) integer, allocatable :: cum( : ) integer :: lo, hi, d, nchunks integer :: pre, L, i, k integer :: acc_, cum_, min_ lo = ( j - 1 ) * psize if ( j == nparts ) then hi = n - 1 else hi = j * psize end if d = hi - lo nchunks = d / CHUNK if ( mod( d, CHUNK ) .ne. 0 ) nchunks = nchunks + 1 acc_ = 0; cum_ = 0; min_ = 2; allocate( a( CHUNK )) allocate( cum( CHUNK )) do k = 1, nchunks pre = lo + ( k - 1 ) * CHUNK L = min( CHUNK, hi - pre ) a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97 cum( 1 ) = cum_ + a( 1 ) do i = 2, L cum( i ) = cum( i - 1 ) + a( i ) end do acc_ = acc_ + count( s( pre + 1 : pre + L ) & == iachar( '1', 1 )) cum_ = cum( L ) min_ = min( min_, minval( cum( 1 : L ))) end do acc_a( j ) = acc_ cum_a( j ) = cum_ min_a( j ) = min_ end block ; end do !$omp end parallel do do j = 2, nparts cum_a( j ) = cum_a( j ) + cum_a( j - 1 ) min_a( j ) = min_a( j ) + cum_a( j - 1 ) end do ret = sum( acc_a ) + ( s( n ) - iachar( '0', 1 )) & - minval( min_a ) end subroutine f_omp

      (OP here again, with long-forgotten PWC 342-2, exploring depths in muddy waters around 4-5 lines of trivial C)

      What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch

      Fun should never be denied:

      v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 10,000 Rate/s % c 87663 100 va 622005 710 String length: 100,000 Rate/s % c 8875 100 c_omp 15439 174 va 71646 807 String length: 1,000,000 Rate/s % c 891 100 c_omp 2307 259 va 12770 1434 String length: 10,000,000 Rate/s % c 88.3 100 c_omp 361.2 409 va 1615.6 1829 String length: 100,000,000 Rate/s % c 8.8 100 c_omp 36.2 410 va 167.7 1900

      'c' and 'c_omp' are for 'mscore_c' (straightforward i.e. "dumb") and 'mscore_c_omp' found in this thread. The 'va' stands for 'vectors, assembly'; where OMP (I have 4 cores) is "automatically" switched on for strings longer than 4e5. CPU is supposed to support AVX2 i.e. to be 2017+.

      There's a "cheat": sequential runs of either '0' or '1's, in target string, shouldn't be longer than 2**31. Because current (running) scores are maintained (dragged along) as 8 per 256-bit registers. Perhaps there is more optimal way to find (horizontal) cumulative sums along 16 bytes, other than 4 shifts/shuffles and 4 additions. What's really funny is that to find (rather, maintain) running maximum over 32 numbers I'm only doing 3 comparisons. But maybe all of the above (i.e. below) looks not funny but ugly to "people who know how".

      use Inline C => Config => ccflagsex => '-fopenmp', libs => '-lgomp -ldl'; use Inline C => << 'END_OF_C'; #include <omp.h> #define MANY 400000 #define MAX_CORES 4 void _helper_a( char* buf, size_t len, int32_t* acc, int32_t* cum, int32_t* max ); void _helper_c( char* buf, size_t len, int32_t* acc, int32_t* cum, int32_t* max ); int mscore_va( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ncores = 1; if ( len >= MANY ) { int n = omp_get_num_procs(); ncores = n > MAX_CORES ? MAX_CORES : n; } int nparts = 2 + ncores; int32_t acc_a[ nparts ]; int32_t cum_a[ nparts ]; int32_t max_a[ nparts ]; memset( acc_a, 0x00, sizeof( acc_a )); memset( cum_a, 0x00, sizeof( cum_a )); memset( max_a, 0xFF, sizeof( max_a )); len --; int32_t prefix_len = ( size_t ) buf % 32; if ( len < prefix_len ) prefix_len = len; int32_t suffix_len = ( len - prefix_len ) % 32; int32_t aligned_len = len - prefix_len - suffix_len; char* prefix_start = buf; char* aligned_start = buf + prefix_len; char* suffix_start = aligned_start + aligned_len; if ( prefix_len > 0 ) { _helper_c( prefix_start, prefix_len, acc_a, cum_a, max_a ); } if ( suffix_len > 0 ) { _helper_c( suffix_start, suffix_len, &( acc_a[ nparts - 1 ]), &( cum_a[ nparts - 1 ]), &( max_a[ nparts - 1 ])); } if ( aligned_len > 0 ) { if ( ncores == 1 ) { _helper_a( aligned_start, aligned_len, &( acc_a[ 1 ]), &( cum_a[ 1 ]), &( max_a[ 1 ])); } else { size_t psize = len / 32 / ncores * 32; #pragma omp parallel for schedule( static, 1 ) \ num_threads( ncores ) for ( int j = 0; j < ncores; j ++ ) { char* start = aligned_start + j * psize; size_t len_ = ( j < ncores - 1 ) ? psize : aligned_len + - j * psize; _helper_a( start, len_, &( acc_a[ j + 1 ]), &( cum_a[ j + 1 ]), &( max_a[ j + 1 ])); } } } int32_t acc = acc_a[ 0 ]; int32_t max = max_a[ 0 ]; for ( int j = 1; j < nparts; j ++ ) { acc += acc_a[ j ]; cum_a[ j ] += cum_a[ j - 1 ]; max_a[ j ] += cum_a[ j - 1 ]; if ( max_a[ j ] > max ) max = max_a[ j ]; } return acc + max + ( buf[ len ] == '1' ); } void _helper_c( char* buf, size_t len, // in int32_t* acc, int32_t* cum, int32_t* max ) { // out for( int i = 0; i < len; i ++ ) { int one = buf[ i ] - '0'; *acc += one; *cum += one ? -1 : 1; if ( *cum > *max ) *max = *cum; } } void _helper_a( char* buf, size_t len, // in int32_t* acc, int32_t* cum, int32_t* max ) { // out void* fin = buf + len; int32_t acc_, cum_, max_; __asm__ ( // " IDDQD \n" " XOR %[acc], %[acc] \n" " MOV $0x00, %%r11 \n" " MOVQ %%r11, %%xmm0 \n" " VPBROADCASTB %%xmm0, %%ymm0 \n" // ymm0 = cum " MOV $0xFF, %%r11 \n" " MOVQ %%r11, %%xmm1 \n" " VPBROADCASTB %%xmm1, %%ymm1 \n" // ymm1 = max " MOV $0x61, %%r11 \n" " MOVQ %%r11, %%xmm2 \n" " VPBROADCASTB %%xmm2, %%ymm2 \n" // ymm2 = 97 x 32 " MOV $0x0F, %%r11 \n" " MOVQ %%r11, %%xmm6 \n" " VPBROADCASTB %%xmm6, %%ymm6 \n" // ymm6 = 15 x 32 " MOV %[buf], %%r8 \n" " MOV %[fin], %%r9 \n" " MOV $32, %%r10 \n" ".L2_: \n" " VMOVDQA (%%r8), %%ymm3 \n" " VPADDB %%ymm3, %%ymm3, %%ymm3 \n" " VPSUBB %%ymm3, %%ymm2, %%ymm3 \n" " VPMOVMSKB %%ymm3, %%ecx \n" " POPCNT %%ecx, %%ecx \n" " ADD %%ecx, %[acc] \n" " VPSLLDQ $1, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $2, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $4, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $8, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSHUFB %%ymm6, %%ymm3, %%ymm4 \n" " VPSHUFD $0b01001110, %%ymm3, %%ymm5 \n" " VPMAXSB %%ymm3, %%ymm5, %%ymm3 \n" " VPMOVSXBD %%xmm3, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm5 \n" " VPMAXSD %%ymm1, %%ymm5, %%ymm1 \n" " VPMOVSXBD %%xmm4, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm0 \n" " VPERMQ $0b01001110, %%ymm3, %%ymm3 \n" " VPERMQ $0b01001110, %%ymm4, %%ymm4 \n" " VPMOVSXBD %%xmm3, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm5 \n" " VPMAXSD %%ymm1, %%ymm5, %%ymm1 \n" " VPMOVSXBD %%xmm4, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm0 \n" " ADD %%r10, %%r8 \n" " CMP %%r9, %%r8 \n" " JL .L2_ \n" " VPERMQ $0b00111001, %%ymm1, %%ymm4 \n" " VPMAXSD %%ymm1, %%ymm4, %%ymm1 \n" " VPERMQ $0b01001110, %%ymm1, %%ymm4 \n" " VPMAXSD %%ymm1, %%ymm4, %%ymm1 \n" " PSHUFD $0b00111001, %%xmm1, %%xmm4 \n" " PMAXSD %%xmm4, %%xmm1 \n" " MOVD %%xmm1, %[max] \n" " MOVD %%xmm0, %[cum] \n" : [acc] "=r" ( acc_ ), [cum] "=rm" ( cum_ ), [max] "=rm" ( max_ ) : [buf] "m" ( buf ), [fin] "m" ( fin ) : "cc", "rcx", "r8", "r9", "r10", "r11", "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm6" ); *acc = acc_; *cum = cum_; *max = max_; } END_OF_C

        Sorry, hopefully this will be the last instalment on that same PWC 342-2 -- because mission (kind of) accomplished, with symbolic and "all-important" order-of-magnitude speed gain over "plain" C on a single core, through a little re-arrangement, partial loop unrolling (step in 96 bytes instead of 32: sums of "-1" and "1"s won't overflow and can be kept as bytes a little longer) and different choice for some instructions (TIMTOWTDI, there's real zoo of them). Apologies (can't edit parent), also, for calling AVX2 as "2017+", of course it's older; and using "len" in place of "aligned_len" above, once: it doesn't affect speed nor result, but is just unclean.

        String length: 10,000 Rate/s % c 88080 100 va_single 811975 922 String length: 100,000 Rate/s % c 8945 100 va_single 108085 1208 String length: 1,000,000 Rate/s % c 894 100 va_single 10802 1208 String length: 10,000,000 Rate/s % c 88.4 100 va_single 913.5 1033 String length: 100,000,000 Rate/s % c 8.8 100 va_single 83.4 946

        However, no gain (compared to "unoptimised" version in parent node) with OMP (same CPU with 4 cores):

        String length: 100,000,000 Rate/s % c 8.8 100 va_omp 168.7 1910

        That, and relative decrease in advantage for very long input (no additional memory allocation occurs, but only the same simple processing of string chunks, sequentially), I can only speculate is result of throttling of some kind.

        And by the way, I also did try to place "mscore_c" C function in separate file, then compile it with "-O3 -march=native" (can't do it with Inline::C, can I?) which would optimise/vectorise to the best of compiler's "voodoo", as googling suggests; then link the library and call the wrapped function from Inline::C. Well, for uniform '"1"s only' string it gave ~25% gain, but for a "random" string it is 3 times slower (than "-O2" i.e. compiled directly from within Inline::C.) This is ridiculous, "voodoo", "A.I", "free" optimisations, and what not.