in reply to Re^3: Weird performance issue with Strawberries and Inline::C
in thread Weird performance issue with Strawberries and Inline::C

Sorry, hopefully this will be the last instalment on that same PWC 342-2 -- because mission (kind of) accomplished, with symbolic and "all-important" order-of-magnitude speed gain over "plain" C on a single core, through a little re-arrangement, partial loop unrolling (step in 96 bytes instead of 32: sums of "-1" and "1"s won't overflow and can be kept as bytes a little longer) and different choice for some instructions (TIMTOWTDI, there's real zoo of them). Apologies (can't edit parent), also, for calling AVX2 as "2017+", of course it's older; and using "len" in place of "aligned_len" above, once: it doesn't affect speed nor result, but is just unclean.

String length: 10,000 Rate/s % c 88080 100 va_single 811975 922 String length: 100,000 Rate/s % c 8945 100 va_single 108085 1208 String length: 1,000,000 Rate/s % c 894 100 va_single 10802 1208 String length: 10,000,000 Rate/s % c 88.4 100 va_single 913.5 1033 String length: 100,000,000 Rate/s % c 8.8 100 va_single 83.4 946

However, no gain (compared to "unoptimised" version in parent node) with OMP (same CPU with 4 cores):

String length: 100,000,000 Rate/s % c 8.8 100 va_omp 168.7 1910

That, and relative decrease in advantage for very long input (no additional memory allocation occurs, but only the same simple processing of string chunks, sequentially), I can only speculate is result of throttling of some kind.

And by the way, I also did try to place "mscore_c" C function in separate file, then compile it with "-O3 -march=native" (can't do it with Inline::C, can I?) which would optimise/vectorise to the best of compiler's "voodoo", as googling suggests; then link the library and call the wrapped function from Inline::C. Well, for uniform '"1"s only' string it gave ~25% gain, but for a "random" string it is 3 times slower (than "-O2" i.e. compiled directly from within Inline::C.) This is ridiculous, "voodoo", "A.I", "free" optimisations, and what not.

Replies are listed 'Best First'.
Re^5: Weird performance issue with Strawberries and Inline::C
by Anonymous Monk on Nov 12, 2025 at 13:53 UTC
    can't do it with Inline::C, can I?

    Yes I can, no need in separate source file; here to prove my claims from parent node:

    use strict; use warnings; use feature 'say'; use Benchmark 'cmpthese'; use Config; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $L = 1e6; my $str = '1' x $L; BEGIN { our $c = << 'END_OF_C'; int %s( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); len --; int acc = 0; int cum = 0; int max = -1; for( int i = 0; i < len; i ++ ) { int one = buf[ i ] - '0'; acc += one; cum += one ? -1 : 1; if ( cum > max ) max = cum; } return acc + max + ( buf[ len ] == '1' ); } END_OF_C } use Inline C => ' #pragma GCC target( "avx2" ) #pragma GCC optimize( "O3" ) ' . sprintf our $c, 'mscore_optimised'; use Inline C => sprintf our $c, 'mscore_default'; my %tests = ( default => sub { mscore_default( $str )}, optimised => sub { mscore_optimised( $str )}, ); say 'uniform string:'; cmpthese -1, \%tests; substr $str, rand $L, 1 , '0' for 1 .. $L; say 'random string:'; cmpthese -1, \%tests; __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 uniform string: Rate default optimised default 897/s -- -19% optimised 1111/s 24% -- random string: Rate optimised default optimised 245/s -- -73% default 897/s 266% --
Re^5: Weird performance issue with Strawberries and Inline::C
by Anonymous Monk on Nov 17, 2025 at 14:54 UTC

    OT:

    Cumulative sums are fun. Here's its application (sum "-1" and "1"s, but with "saturation": don't fall below zero) for another task, PWC 346-1 "Longest Parenthesis". There are at least half a dozen identical (i.e. verbatim) Perl solutions, doesn't matter if it's because task is trivial, or "AI" was involved, or subroutine was copied. In fact, that Perl solution is significantly faster for medium and long inputs than a few others based on nested loops or regular expressions. Below is my translation of it to C. (I'm not fluent in C, can it be written better? Hopefully, I haven't put it in much disadvantage because of that.)

    Surprisingly, my "cumulative sum" solution is faster, though it involves second sweep over input and seemingly more arithmetic. For shorter, 1000 random strings 1000 chars long:

    v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 Rate pwc sum pwc 199/s -- -46% sum 365/s 84% --

    And for set of 100 of longer strings (100_000 length):

    Rate pwc sum pwc 21.4/s -- -56% sum 48.2/s 125% --

    Why C and, hence, OT? I had Perl prototype, somewhat faster than mentioned Perl PWC solution; then did PDL prototype (~ 10x .. 20x faster, of course), but had to write a small PP fragment for it; and thought why not do it in C completely then, anyway. Had no intention to translate PWC Perl to C for competition; but found C++ solution in GH PWC directory. Something to compile to machine code, hurray. Well, for a single 1e6 chars long string it took twenty seconds, no joking. (C functions, below, are in milliseconds).

    Perl code for the results above:

    use strict; use warnings; use feature 'say'; use Config; use Benchmark 'cmpthese'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $L = 1e5; my $s = '(' x $L; my @samples; for ( 0 ... 99 ) { my $s1 = $s; for ( 0 .. $L - 1 ) { substr $s1, $_, 1, ')' if int rand 2 } push @samples, $s1; die unless longest_p_pwc( $s1 ) == longest_p_sum( $s1 ) } cmpthese -3, { pwc => sub { longest_p_pwc( $_ ) for @samples }, sum => sub { longest_p_sum( $_ ) for @samples }, }; sub valid_longest_parenthesis { my $s = shift; my @stack = (-1); my $max_len = 0; for my $i (0 .. length($s) - 1) { if (substr($s, $i, 1) eq "(") { push @stack, $i; } else { pop @stack; if (@stack) { $max_len = $max_len > ($i - $stack[-1]) ? $max_len : ($i - $stack[-1]); } else { push @stack, $i; # New starting point } } } return $max_len; } use Inline C => << 'END_OF_C'; int longest_p_pwc( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int* stack = malloc( len * sizeof( int )); int stack_i = 0; stack[ 0 ] = -1; int max = 0; for ( int i = 0; i < len; i ++ ) { if ( buf[ i ] == '(' ) stack[ ++ stack_i ] = i; else { if ( stack_i > 0 ) { stack_i --; int cur = i - stack[ stack_i ]; if ( cur > max ) max = cur; } else stack[ 0 ] = i; } } free( stack ); return max; } int longest_p_sum( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int* pad = malloc( len * sizeof( int )); int sum = 0; for ( int i = 0; i < len; i ++ ) pad[ i ] = sum = sum + !sum + ( buf[ i ] == '(' ? 1 : -1 ); int max = 0; int cur = 0; int counting = 0; int zeroless = 1; sum = 0; for ( int i = len - 1; i >= 0; i -- ) { zeroless = zeroless && pad[ i ]; sum = zeroless ? sum + !sum + ( buf[ i ] == ')' ? 1 : -1 ) : 1; if ( pad[ i ] && sum ) { cur ++; counting = 1; } else if ( counting ) { if ( cur > max ) max = cur; counting = 0; cur = 0; } } if ( cur > max ) max = cur; free( pad ); return max; } END_OF_C