Weird performance issue with Strawberries and Inline::C

Replies are listed 'Best First'.
Re: Weird performance issue with Strawberries and Inline::C by NERDVANA (Priest) on Oct 12, 2025 at 20:03 UTC
C3 avoids a jump instruction by always performing an addition. For a super-tight loop like this, those jumps can make a big difference. I'm not sure that I would expect a C compiler to realize that C4 could be rewritten as C3. I had a longer answer, but ran it past Claude.AI and it knew a lot more about the topic than I did and my answer would have been wrong. Of course this could be inaccurate/wrong too, but it looks more accurate than what I was going to say. At the risk of offending people, here's is a portion of its answer: On x86-64, there isn't a direct conditional arithmetic instruction like you describe, but there are several approaches that can eliminate branches: CMOVcc (Conditional Move) - The most common approach `; Assuming condition result is in a flag mov eax, i lea edx, [rax + 1] ; edx = i + 1 cmovnz eax, edx ; if condition != 0, eax = edx mov i, eax` [download] or more directly: `lea edx, [i + 1] test condition, condition cmovnz i, edx ; conditionally move the incremented value` [download] SETcc (Set Byte on Condition) - Convert condition to 0/1 `xor eax, eax test condition, condition setnz al ; al = (condition != 0) ? 1 : 0 add i, eax ; i += 0 or 1` [download] This is often the most efficient for your exact use case - it turns the condition into a 0 or 1, then unconditionally adds it. ADC/SBB (Add/Subtract with Carry) - If condition is in carry flag `; If your condition sets the carry flag: adc i, 0 ; i += carry flag` [download] The SETcc + ADD pattern is typically what compilers generate for branchless if (cond) i++, and it's usually faster than a branch for unpredictable conditions.	[reply] [d/l] [select]
Re^2: Weird performance issue with Strawberries and Inline::C by tonyc (Hermit) on Oct 12, 2025 at 21:14 UTC
Strawberryperl 5.32 used gcc 8.3.0 while 5.42.0.1 uses gcc 13.2.0 which probably accounts for the difference between versions. From looking at godbolt gcc 8.3.0 optimizes that branch into the conditional expression while 13.2.0 doesn't	[reply]
Re^3: Weird performance issue with Strawberries and Inline::C by syphilis (Archbishop) on Oct 13, 2025 at 02:53 UTC
From looking at godbolt gcc 8.3.0 optimizes that branch into the conditional expression while 13.2.0 doesn't Going back through my own builds of perl on Windows, I can see that the rot goes back as far as gcc-10. I don't know if gcc-9 was afflicted with this same issue as I don't have a perl that was built with gcc-9. Incidentally, things are much better with perl-5.42.0 built using Microsoft's Visual Studio 2022: `v5.42.0 String length: 100000 Rate c4 c3 c4 13558/s -- -8% c3 14740/s 9% --` [download] (Well done, them ;-) BTW, the OP will probably be able to use gcc-8.3.0 to build the script, and reap the benefits of the better optimization capabilities provided by gcc-8.3.0. I inserted the following just prior to the "`use Inline C => << 'END_OF_C';`" in the OP's script. `use Inline C => Config => # Force recompilation FORCE_BUILD => 1, # Set CC to to path to gcc.exe version 8.3.0 CC => 'C:/sp/_64/sp-5.32.0/c/bin/gcc.exe', # View build output BUILD_NOISY => 1, ;` [download] That worked fine for me on my build of perl-5.42.0, using gcc-15.1.0: `v5.42.0 String length: 100000 Rate c3 c4 c3 14460/s -- -0% c4 14531/s 0% --` [download] Without that modification, the output was: `v5.42.0 String length: 100000 Rate c4 c3 c4 3528/s -- -76% c3 14777/s 319% --` [download] UPDATE: I tried that same hack of using gcc.exe version 8.3.0 with current blead (built using gcc-15.2.0) and it failed with: `try2_pl_62b5.c: loadable library and perl binaries are mismatched (got + first handshake key 0000000012e00080, needed 0000000012d00080)` [download] ~~Looks like us hackers have now been deprived of yet another liberty.~~ It's a bit more fickle than I thought. The hacked script works fine for my own build of perl-5.42.0, but not for Strawberry's build of perl-5.42.0. (There are small differences between those two builds of 5.42.0, but it would make better sense to me if it was the other way around. Anyway .... whilst I find this to be tantalizingly interesting, it's not massively important.) (If it's going to crash, I would prefer that they let it just do that - rather than forbid something simply because the practice is deemed to be dubious.) Cheers, Rob	[reply] [d/l] [select]
Re^3: Weird performance issue with Strawberries and Inline::C by NERDVANA (Priest) on Oct 13, 2025 at 02:51 UTC
Wow that's a cool website! Never seen that before.	[reply]
Re^2: Weird performance issue with Strawberries and Inline::C by ikegami (Patriarch) on Oct 13, 2025 at 00:35 UTC
You explained that c4 is slower than c3, but it wasn't in 5.32. The question isn't why c4 is slow; it's why it's slow now and not before.	[reply]
Re^3: Weird performance issue with Strawberries and Inline::C by NERDVANA (Priest) on Oct 13, 2025 at 02:46 UTC
I was just making the point that C4 is naturally expected to be slower, to emphasize that you should write it like C3 if you don't want to rely on compiler optimization voodoo. It's nice when compiler optimizations work, but better to not rely on them. And yes I didn't do any of the effort to track down which optimization was lost and whether Perl's tooling was responsible.	[reply]
Re: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Oct 13, 2025 at 11:55 UTC
Thank you, all. FWIW, replacing the starred line with "`acc += buf[ i ] - '0';`" gives 10% boost over C3 in both 5.32 and 5.42.	[reply] [d/l]
Re: Weird performance issue with Strawberries and Inline::C by ikegami (Patriarch) on Oct 12, 2025 at 19:33 UTC
Show the command used to compile these. The `noisy` or `info` option should do that. `use Inline qw( force noisy info );` [download] or `perl -MInline=force,noisy,info ...` [download] Additionally, the output of the following is possibly pertinent: `perl -V:"cc(flags)?"` [download] (That's an uppercase "`V`".)	[reply] [d/l] [select]
Re^2: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Oct 12, 2025 at 19:58 UTC
C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL>perl -V:"cc(flags)?" cc='gcc'; ccflags=' -DWIN32 -DWIN64 -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCR +IPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv + -fno-strict-aliasing -mms-bitfields'; C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL>perl -MInline=force,no +isy,info test_inline.pl validate Stage <-----------------------Information Section--------------------------- +--------> Information about the processing of your Inline C code: Your source code needs to be compiled. I'll use this build directory: C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\_Inline\build\test_inl +ine_pl_62b5 and I'll install the executable as: C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\_Inline\lib\auto\test_ +inline_pl_62b5\test_inline_pl_62b5.xs.dll get_maps Stage The following Inline C function(s) have been successfully bound to Per +l: int test_c_3(SV * str) int test_c_4(SV * str) <-----------------------End of Information Section-------------------- +--------> Starting Build Preprocess Stage Finished Build Preprocess Stage Starting Build Parse Stage Finished Build Parse Stage Starting Build Glue 1 Stage Finished Build Glue 1 Stage Starting Build Glue 2 Stage Finished Build Glue 2 Stage Starting Build Glue 3 Stage Finished Build Glue 3 Stage Starting Build Compile Stage Starting "perl Makefile.PL" Stage Generating a gmake-style Makefile Writing Makefile for test_inline_pl_62b5 Writing MYMETA.yml and MYMETA.json Finished "perl Makefile.PL" Stage Starting "make" Stage Running Mkbootstrap for test_inline_pl_62b5 () "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command -e chmod -- 644 "test_inline_pl_62b5.bs" "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a +rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644 "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" "C +:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\lib\ExtUtils/xsub +pp" -typemap "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\l +ib\ExtUtils\typemap" test_inline_pl_62b5.xs > test_inline_pl_62b5.x +sc "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command -e mv -- test_inline_pl_62b5.xsc test_inline_pl_62b +5.c gcc -c -iquote"C:/berrybrew/strawberry-perl-5.32.1.1-64bit-PDL" -DWIN +32 -DWIN64 -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IM +PLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-a +liasing -mms-bitfields -s -O2 -DVERSION=\"0.00\" -DXS_VERSION=\"0.0 +0\" "-IC:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\lib\CORE +" test_inline_pl_62b5.c "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Mksymlists \ -e "Mksymlists('NAME'=>\"test_inline_pl_62b5\", 'DLBASE' => 'test +_inline_pl_62b5', 'DL_FUNCS' => { }, 'FUNCLIST' => [], 'IMPORTS' => +{ }, 'DL_VARS' => []);" g++.exe test_inline_pl_62b5.def -o blib\arch\auto\test_inline_pl_62b5\ +test_inline_pl_62b5.xs.dll -mdll -s -L"C:\berrybrew\strawberry-perl-5 +.32.1.1-64bit-PDL\perl\lib\CORE" -L"C:\berrybrew\strawberry-perl-5.32 +.1.1-64bit-PDL\c\lib" test_inline_pl_62b5.o "C:\berrybrew\strawberr +y-perl-5.32.1.1-64bit-PDL\perl\lib\CORE\libperl532.a" -lmoldname -lke +rnel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -l +ole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion - +lodbc32 -lodbccp32 -lcomctl32 -Wl,--enable-auto-image-base "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command -e chmod -- 755 blib\arch\auto\test_inline_pl_62b5\ +test_inline_pl_62b5.xs.dll Finished "make" Stage Starting "make install" Stage "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a +rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644 Files found in blib\arch: installing files in blib\lib into architectu +re dependent library tree Installing C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\_Inline\lib +\auto\test_inline_pl_62b5\test_inline_pl_62b5.xs.dll Finished "make install" Stage Starting Cleaning Up Stage Finished Cleaning Up Stage Finished Build Compile Stage v5.32.1 String length: 100000 Rate c3 c4 c3 8706/s -- -0% c4 8706/s 0% -- [download] ... C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL>perl -V:"cc(flags)?" cc='gcc'; ccflags='-std=c99 -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICI +TY -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv +-fno-strict-aliasing -mms-bitfields'; C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL>perl -MInline=force,no +isy,info test_inline.pl validate Stage <-----------------------Information Section--------------------------- +--------> Information about the processing of your Inline C code: Your source code needs to be compiled. I'll use this build directory: C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\_Inline\build\test_inl +ine_pl_62b5 and I'll install the executable as: C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\_Inline\lib\auto\test_ +inline_pl_62b5\test_inline_pl_62b5.xs.dll get_maps Stage The following Inline C function(s) have been successfully bound to Per +l: int test_c_3(SV * str) int test_c_4(SV * str) <-----------------------End of Information Section-------------------- +--------> Starting Build Preprocess Stage Finished Build Preprocess Stage Starting Build Parse Stage Finished Build Parse Stage Starting Build Glue 1 Stage Finished Build Glue 1 Stage Starting Build Glue 2 Stage Finished Build Glue 2 Stage Starting Build Glue 3 Stage Finished Build Glue 3 Stage Starting Build Compile Stage Starting "perl Makefile.PL" Stage Generating a gmake-style Makefile Writing Makefile for test_inline_pl_62b5 Writing MYMETA.yml and MYMETA.json Finished "perl Makefile.PL" Stage Starting "make" Stage Running Mkbootstrap for test_inline_pl_62b5 () "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command -e chmod -- 644 "test_inline_pl_62b5.bs" "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a +rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644 "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" "C +:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\lib\ExtUtils/xsub +pp" -typemap "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\l +ib\ExtUtils\typemap" test_inline_pl_62b5.xs > test_inline_pl_62b5.x +sc "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command -e mv -- test_inline_pl_62b5.xsc test_inline_pl_62b +5.c gcc -c -iquote"C:/berrybrew/strawberry-perl-5.42.0.1-64bit-PDL" -std= +c99 -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMP +LICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv -fno-strict-a +liasing -mms-bitfields -O2 -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" + "-IC:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\lib\CORE" + test_inline_pl_62b5.c "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Mksymlists \ -e "Mksymlists('NAME'=>\"test_inline_pl_62b5\", 'DLBASE' => 'test +_inline_pl_62b5', 'DL_FUNCS' => { }, 'FUNCLIST' => [], 'IMPORTS' => +{ }, 'DL_VARS' => []);" g++.exe test_inline_pl_62b5.def -o blib\arch\auto\test_inline_pl_62b5\ +test_inline_pl_62b5.xs.dll -mdll -s -L"C:\berrybrew\strawberry-perl-5 +.42.0.1-64bit-PDL\perl\lib\CORE" -L"C:\berrybrew\strawberry-perl-5.42 +.0.1-64bit-PDL\c\lib" test_inline_pl_62b5.o "C:\berrybrew\strawberr +y-perl-5.42.0.1-64bit-PDL\perl\lib\CORE\libperl542.a" -lmoldname -lke +rnel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -l +ole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion - +lodbc32 -lodbccp32 -lcomctl32 -Wl,--enable-auto-image-base "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command -e chmod -- 755 blib\arch\auto\test_inline_pl_62b5\ +test_inline_pl_62b5.xs.dll Finished "make" Stage Starting "make install" Stage "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M +ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a +rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644 Files found in blib\arch: installing files in blib\lib into architectu +re dependent library tree Installing C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\_Inline\lib +\auto\test_inline_pl_62b5\test_inline_pl_62b5.xs.dll Finished "make install" Stage Starting Cleaning Up Stage Finished Cleaning Up Stage Finished Build Compile Stage v5.42.0 String length: 100000 Rate c4 c3 c4 2456/s -- -72% c3 8706/s 254% -- [download]	[reply] [d/l] [select]
Re^3: Weird performance issue with Strawberries and Inline::C by ikegami (Patriarch) on Oct 13, 2025 at 00:37 UTC
I see `-O2` in both cases. It would be interesting to see the disassembly, but I'm not sure how to get that.	[reply] [d/l]
Re: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Oct 22, 2025 at 23:48 UTC
Here's more on compiler voodoo, though it involves a very trivial problem and primitive code; I either publish it (it's Perl related after all) or just throw it away. The PWC-342-2 was all about finding a maximum (an extremum, depends on POV) over cumulative sums. This is camouflaged, in my pure-Perl solution (below) by use of a regex; which is fastest (?) (over PWC GH directory) and readable OK. At the same time I wrote a PDL solution (rather, Perl/PDL combo); which is of course faster and no less (?) readable. Then (2 weeks ago?), because confused about OOK and "4-args-substr" (not used below; and the confusion cleared now, thanks to the answers to another SOPW question), I wrote a "straightforward" C solution, followed by a tweaked (through simple arithmetic) -- "c_better" -- solution. Further yet, I tried another -- Fortran -- language solution, pluggable directly into Perl's `Benchmark` harness thanks to `Inline::C::Cookbook` recipe. The `c2f` subroutine is almost exact, line-by-line translation of `c_better` to Fortran, and yet it's _faster_, don't know why, by the same margin regardless of `-O2` or `-O3` used to compile it, and despite of a C wrapper used to call it. (But, for the record, the "f" subroutine(s) require `-O3` to be the fastest.) Then, inspired by this success (?), I tried the direct, albeit perhaps mechanical, translation of PDL solution to Fortran. The latter doesn't have the cumulative sums intrinsic, so this one I had to do as an explicit loop. Unlike my C and "c2f", it performs several passes over input data (create a mask, count, transliterate, do cum sums, find a minimum) and stores intermediate arrays. However, it's _fastest_, for 1e4 and 1e5 input. use strict; use warnings; use feature 'say'; use Config; use Benchmark 'cmpthese'; use lib '.'; use Max_score_subs ':all'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $str; my %tests = ( perl => sub { mscore_perl( $str )}, pdl => sub { mscore_pdl( $str )}, c => sub { mscore_c( $str )}, c_better => sub { mscore_c_better( $str )}, c2f => sub { mscore_c2f( $str )}, f => sub { mscore_f( $str )}, ); for my $L ( 1e3, 1e4, 1e5, 1e6, 1e7, 1e8 ) { say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr; $str = '1' x $L; substr $str, rand $L, 1 , '0' for 1 .. $L; delete $tests{ perl } if $L > 1e6; delete $tests{ pdl } if $L > 1e7; cmpthese -1, \%tests; } __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 1,000 Rate perl pdl c c_better f c2 +f perl 6637/s -- -90% -99% -99% -99% -99 +% pdl 67389/s 915% -- -91% -92% -92% -93 +% c 764229/s 11415% 1034% -- -8% -13% -16 +% c_better 830711/s 12416% 1133% 9% -- -5% -9 +% f 876596/s 13108% 1201% 15% 6% -- -4 +% c2f 911910/s 13640% 1253% 19% 10% 4% - +- String length: 10,000 Rate perl pdl c c_better c2f +f perl 635/s -- -94% -99% -99% -99% -99 +% pdl 11113/s 1651% -- -87% -88% -89% -90 +% c 86164/s 13476% 675% -- -9% -18% -26 +% c_better 94730/s 14826% 752% 10% -- -9% -19 +% c2f 104655/s 16390% 842% 21% 10% -- -10 +% f 116531/s 18261% 949% 35% 23% 11% - +- String length: 100,000 Rate perl pdl c c_better c2f f perl 63.1/s -- -95% -99% -99% -99% -99% pdl 1161/s 1741% -- -87% -88% -89% -91% c 8934/s 14061% 669% -- -8% -18% -28% c_better 9692/s 15262% 734% 8% -- -11% -22% c2f 10925/s 17216% 841% 22% 13% -- -12% f 12350/s 19475% 963% 38% 27% 13% -- String length: 1,000,000 Rate perl pdl f c c_better c2f perl 6.31/s -- -94% -99% -99% -99% -99% pdl 104/s 1554% -- -80% -88% -89% -90% f 535/s 8370% 412% -- -38% -45% -51% c 865/s 13600% 728% 62% -- -11% -20% c_better 968/s 15243% 828% 81% 12% -- -11% c2f 1085/s 17095% 940% 103% 26% 12% -- String length: 10,000,000 Rate pdl f c c_better c2f pdl 10.1/s -- -76% -89% -90% -90% f 41.5/s 311% -- -53% -57% -61% c 88.7/s 778% 114% -- -8% -16% c_better 96.4/s 854% 132% 9% -- -9% c2f 106/s 947% 155% 19% 10% -- String length: 100,000,000 Rate f c c_better c2f f 4.39/s -- -51% -54% -58% c 8.87/s 102% -- -7% -14% c_better 9.55/s 118% 8% -- -8% c2f 10.3/s 136% 17% 8% -- [download] `Max_score_subs.pm`: package Max_score_subs; use strict; use warnings; use feature 'say'; use PDL; use Exporter 'import'; our %EXPORT_TAGS = ( all => [ qw/ mscore_perl mscore_pdl mscore_c mscore_c_better mscore_c2f mscore_f /], ); Exporter::export_ok_tags( 'all' ); sub mscore_perl { my $s = shift; my $max = '0' eq substr $s, 0, 1; $s = substr $s, 1; $max += $s =~ tr/1//; chop $s; my $score = $max; while ( $s =~ /(0)(1)/g ) { $score += length $1; $max = $score if $score > $max; $score -= length $2 } return $max } sub mscore_pdl { my $s = shift; my $count = $s =~ tr/1//; chop $s; $s =~ tr/01/\xff\1/; my $p = zeroes sbyte, length $s; ${ $p-> get_dataref } = $s; $p-> upd_data; $count - $p-> cumusumover -> minover -> sclr } use FindBin; use Inline Config => force_build => 1; use Inline C => Config => libs => "-L$FindBin::Bin -lmscore_f -lgfortran -lquadmath"; use Inline C => << 'END_OF_C'; extern void c2f_( void* n, void* s, void* ret ); extern void f_( void* n, void* s, void* ret ); int mscore_c( SV* str ) { int i, acc, score, max, one; STRLEN len; char* buf = SvPV( str, len ); len --; acc = 0; score = 0; max = -1; for( i = 0; i < len; i ++ ) { one = buf[ i ] - '0'; acc += one; score += one ? -1 : 1; max = score > max ? score : max; } return max + acc + ( buf[ i ] == '1' ); } int mscore_c_better( SV* str ) { int i, acc, tmp, max; STRLEN len; char* buf = SvPVbyte( str, len ); len --; acc = 0; max = -2; for( i = 0; i < len; i ++ ) { acc += buf[ i ] - '0'; tmp = i - 2 * acc; max = tmp > max ? tmp : max; } return max + acc + ( buf[ len ] == '1' ) + 1; } int mscore_c2f( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ret; c2f_( &len, buf, &ret ); return ret; } int mscore_f( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ret; f_( &len, buf, &ret ); return ret; } END_OF_C 1; [download] `mscore_f.f90:` subroutine c2f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: i, acc, mx, tmp acc = 0 mx = -2 do i = 1, n - 1 acc = acc + ( s( i ) - iachar( '0', 1 )) tmp = i - 1 - 2 * acc mx = max( mx, tmp ) end do ret = acc + mx + ( s( n ) - iachar( '0', 1 )) + 1 end subroutine c2f subroutine f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1 ! 97 integer :: i integer * 1 :: a( n ) ! temporary pad integer :: cum( n ) ! cumulative sums ret = count( s == iachar( '1', 1 )) ! tr/1// a = 2 * s - A97 ! tr/01/\xff\1/r a( n ) = 0 ! "chop", kind of cum( 1 ) = a( 1 ) do i = 2, n cum( i ) = cum( i - 1 ) + a( i ) end do ret = ret - minval( cum ) end subroutine f [download] The jitter for fast contestants for very short (1e3) input is perhaps within margin of error; but what about "f" getting slower for 1e6 and above? I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything? Anyway, it then occurred to me, the scratch-pads can be much smaller and re-used many times, both for a single call (i.e. cumulative sums and minimums perhaps to be done in (quite) a few steps, following the consecutive nature of the problem all the same) and between calls, because declared as module variables i.e. the "`mscore_f.f90`" re-written as a module, per modern fashion: module mscore_f implicit none private integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1 ! 97 integer, parameter :: CHUNK = 25000 integer * 1 :: a( CHUNK ) ! scratch-pad #1 integer :: cum( CHUNK ) ! scratch-pad #2 public :: f, c2f contains subroutine c2f( n, s, ret ) ! skipped, the same code end subroutine c2f subroutine f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: i, j, m, pre, L integer :: rprev ! "running previous" integer :: rmin ! "running minimum" ret = count( s == iachar( '1', 1 )) m = n / CHUNK if ( mod( n, CHUNK ) .ne. 0 ) then m = m + 1 end if rprev = 0; rmin = 2; do j = 1, m pre = ( j - 1 ) * CHUNK L = min( CHUNK, ubound( s, 1 ) - pre - 1 ) a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97 cum( 1 ) = rprev + a( 1 ) do i = 2, L cum( i ) = cum( i - 1 ) + a( i ) end do rmin = min( rmin, minval( cum( 1 : L ))) rprev = cum( L ) end do ret = ret - rmin end subroutine f end module mscore_f [download] And within `Max_score_subs.pm`: `extern void __mscore_f_MOD_c2f( void* n, void* s, void* ret ); extern void __mscore_f_MOD_f( void* n, void* s, void* ret );` [download] Then, with relevant contestants, the results were quite a wow moment for me: v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 1,000 Rate c_better c2f f c_better 829929/s -- -8% -17% c2f 905443/s 9% -- -10% f 1003704/s 21% 11% -- String length: 10,000 Rate c_better c2f f c_better 94179/s -- -11% -22% c2f 105326/s 12% -- -13% f 120618/s 28% 15% -- String length: 100,000 Rate c_better c2f f c_better 9683/s -- -12% -22% c2f 10983/s 13% -- -12% f 12418/s 28% 13% -- String length: 1,000,000 Rate c_better c2f f c_better 969/s -- -10% -22% c2f 1082/s 12% -- -13% f 1246/s 28% 15% -- String length: 10,000,000 Rate c_better c2f f c_better 95.1/s -- -9% -20% c2f 104/s 10% -- -12% f 119/s 25% 14% -- String length: 100,000,000 Rate c_better c2f f c_better 9.55/s -- -9% -18% c2f 10.5/s 10% -- -10% f 11.6/s 22% 11% -- [download]	[reply] [d/l] [select]
Re^2: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Oct 27, 2025 at 11:46 UTC
I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything? At least it's easy to observe that memory is returned to OS upon return from a subroutine, where VLA (variable length array) is declared (and used) -- _if_ array gets large enough such as ~130K 64-bit integers. And _not_ returned to OS for smaller arrays. Don't know if/when other factors may be at play; and I doubt this factoid has any value. But it explains why the original "f" subroutine had been slower for 1e6 and longer input. Let's aim higher than ~40% gain through tweaks and compiler voodoo -- let's get parallel! use strict; use warnings; use feature 'say'; use Config; use Benchmark qw/ timeit :hireswallclock /; use lib '.'; use Max_score_subs ':all'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $str; my %tests = ( # C: c => sub { mscore_c( $str )}, # - dumb c_better => sub { mscore_c_better( $str )}, # - tweaked c_omp => sub { mscore_c_omp( $str )}, # - "c", parallel # # Fortran: c2f => sub { mscore_c2f( $str )}, # - "c_better" (in F) f => sub { mscore_f( $str )}, # - "PDL" (in F) f_omp => sub { mscore_f_omp( $str )}, # - "f", parallel ); my $iters = 2e5; my $fmt = '%8.0f'; for my $L ( 1e4, 1e5, 1e6, 1e7, 1e8 ) { say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr; $str = '1' x $L; substr $str, rand $L, 1 , '0' for 1 .. $L; my $any; my %ret; for my $key ( keys %tests ) { my $result = $tests{ $key }-> (); die "<<< $key!!! >>>" unless $result == ( $any //= $result ); my $t = timeit( $iters, $tests{ $key }); $ret{ $key } = $t-> iters / $t-> real } print " Rate/s %\n"; $fmt = '%8.1f' if $L > 1e6; for my $key ( sort { $ret{ $a } <=> $ret{ $b }} keys %ret ) { printf " %-9s $fmt %5.0f\n", $key, $ret{ $key }, 100 * $ret{ $key } / $ret{ c } } $iters /= 10 } __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 10,000 Rate/s % f_omp 23804 28 c_omp 25362 29 c 86100 100 c_better 91470 106 c2f 105239 122 f 120134 140 String length: 100,000 Rate/s % c 8724 100 c_better 9324 107 c2f 10686 122 f 12139 139 c_omp 15369 176 f_omp 15964 183 String length: 1,000,000 Rate/s % c 875 100 c_better 931 106 c2f 1068 122 f 1213 139 c_omp 2306 264 f_omp 2560 293 String length: 10,000,000 Rate/s % c 86.8 100 c_better 92.2 106 c2f 104.5 120 f 115.6 133 c_omp 349.6 403 f_omp 426.9 492 String length: 100,000,000 Rate/s % c 8.6 100 c_better 9.2 106 c2f 10.4 120 f 11.4 131 c_omp 34.7 401 f_omp 42.4 491 [download] What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch. In hindsight, the improved "c_better" and its translation "c2f" were a trap. With their loops, it's not obvious how to split them in independent parts. OTOH, for the "dumb" C version it's not too difficult to split: int mscore_c_omp( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); len --; int nparts = omp_get_num_procs(); if ( nparts > 4 ) nparts = 4; int psize = len / nparts; int acc_a[ nparts ]; int cum_a[ nparts ]; int max_a[ nparts ]; #pragma omp parallel for schedule( static, 1 ) \ num_threads( nparts ) for ( int j = 0; j < nparts; j ++ ) { int lo = j * psize; int hi = ( j == nparts - 1 ) ? len : lo + psize; int acc = 0; int cum = 0; int max = -1; for ( int i = lo; i < hi; i ++ ) { int one = buf[ i ] - '0'; acc += one; cum += one ? -1 : 1; if ( cum > max ) max = cum; } acc_a[ j ] = acc; cum_a[ j ] = cum; max_a[ j ] = max; } int acc = acc_a[ 0 ]; int max = max_a[ 0 ]; for ( int j = 1; j < nparts; j ++ ) { acc += acc_a[ j ]; cum_a[ j ] += cum_a[ j - 1 ]; max_a[ j ] += cum_a[ j - 1 ]; if ( max_a[ j ] > max ) max = max_a[ j ]; } return acc + max + ( buf[ len ] == '1' ); } [download] In my tests, a string length of ~70k is a cut-off value where OMP versions begin to outrun the original versions. And it's better to just switch to originals rather than adding "if" directive to omp pragma, the above function is somewhat cumbersome and slow. Speaking about cumbersome, the Fortran OMP-enabled sub is a long way from original (elegant) PDL version. At least there remains ~25% gain over "c_omp", through the best of both worlds: parallel with OMP, and compiler voodoo with vectorization. subroutine f_omp( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: j, nparts, psize integer, allocatable :: acc_a ( : ), cum_a( : ), min_a( : ) nparts = omp_get_num_procs() if ( nparts > 4 ) nparts = 4 psize = n / nparts; allocate( acc_a( nparts )) allocate( cum_a( nparts )) allocate( min_a( nparts )) !$omp parallel do schedule( static, 1 ) num_threads( nparts ) do j = 1, nparts ; block integer * 1, allocatable :: a( : ) integer, allocatable :: cum( : ) integer :: lo, hi, d, nchunks integer :: pre, L, i, k integer :: acc_, cum_, min_ lo = ( j - 1 ) * psize if ( j == nparts ) then hi = n - 1 else hi = j * psize end if d = hi - lo nchunks = d / CHUNK if ( mod( d, CHUNK ) .ne. 0 ) nchunks = nchunks + 1 acc_ = 0; cum_ = 0; min_ = 2; allocate( a( CHUNK )) allocate( cum( CHUNK )) do k = 1, nchunks pre = lo + ( k - 1 ) * CHUNK L = min( CHUNK, hi - pre ) a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97 cum( 1 ) = cum_ + a( 1 ) do i = 2, L cum( i ) = cum( i - 1 ) + a( i ) end do acc_ = acc_ + count( s( pre + 1 : pre + L ) & == iachar( '1', 1 )) cum_ = cum( L ) min_ = min( min_, minval( cum( 1 : L ))) end do acc_a( j ) = acc_ cum_a( j ) = cum_ min_a( j ) = min_ end block ; end do !$omp end parallel do do j = 2, nparts cum_a( j ) = cum_a( j ) + cum_a( j - 1 ) min_a( j ) = min_a( j ) + cum_a( j - 1 ) end do ret = sum( acc_a ) + ( s( n ) - iachar( '0', 1 )) & - minval( min_a ) end subroutine f_omp [download]	[reply] [d/l] [select]
Re^3: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 06, 2025 at 11:39 UTC
(OP here again, with long-forgotten PWC 342-2, exploring depths in muddy waters around 4-5 lines of trivial C) What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch Fun should never be denied: `v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 10,000 Rate/s % c 87663 100 va 622005 710 String length: 100,000 Rate/s % c 8875 100 c_omp 15439 174 va 71646 807 String length: 1,000,000 Rate/s % c 891 100 c_omp 2307 259 va 12770 1434 String length: 10,000,000 Rate/s % c 88.3 100 c_omp 361.2 409 va 1615.6 1829 String length: 100,000,000 Rate/s % c 8.8 100 c_omp 36.2 410 va 167.7 1900` [download] 'c' and 'c_omp' are for '`mscore_c`' (straightforward i.e. "dumb") and '`mscore_c_omp`' found in this thread. The 'va' stands for 'vectors, assembly'; where OMP (I have 4 cores) is "automatically" switched on for strings longer than 4e5. CPU is supposed to support AVX2 i.e. to be 2017+. There's a "cheat": sequential runs of either '0' or '1's, in target string, shouldn't be longer than 2*31. Because current (running) scores are maintained (dragged along) as 8 per 256-bit registers. Perhaps there is more optimal way to find (horizontal) cumulative sums along 16 bytes, other than 4 shifts/shuffles and 4 additions. What's really funny is that to find (rather, maintain) running maximum over 32 numbers I'm only doing 3 comparisons. But maybe all of the above (i.e. below) looks not funny but ugly to "people who know how". use Inline C => Config => ccflagsex => '-fopenmp', libs => '-lgomp -ldl'; use Inline C => << 'END_OF_C'; #include <omp.h> #define MANY 400000 #define MAX_CORES 4 void _helper_a( char buf, size_t len, int32_t* acc, int32_t* cum, int32_t* max ); void _helper_c( char* buf, size_t len, int32_t* acc, int32_t* cum, int32_t* max ); int mscore_va( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ncores = 1; if ( len >= MANY ) { int n = omp_get_num_procs(); ncores = n > MAX_CORES ? MAX_CORES : n; } int nparts = 2 + ncores; int32_t acc_a[ nparts ]; int32_t cum_a[ nparts ]; int32_t max_a[ nparts ]; memset( acc_a, 0x00, sizeof( acc_a )); memset( cum_a, 0x00, sizeof( cum_a )); memset( max_a, 0xFF, sizeof( max_a )); len --; int32_t prefix_len = ( size_t ) buf % 32; if ( len < prefix_len ) prefix_len = len; int32_t suffix_len = ( len - prefix_len ) % 32; int32_t aligned_len = len - prefix_len - suffix_len; char* prefix_start = buf; char* aligned_start = buf + prefix_len; char* suffix_start = aligned_start + aligned_len; if ( prefix_len > 0 ) { _helper_c( prefix_start, prefix_len, acc_a, cum_a, max_a ); } if ( suffix_len > 0 ) { _helper_c( suffix_start, suffix_len, &( acc_a[ nparts - 1 ]), &( cum_a[ nparts - 1 ]), &( max_a[ nparts - 1 ])); } if ( aligned_len > 0 ) { if ( ncores == 1 ) { _helper_a( aligned_start, aligned_len, &( acc_a[ 1 ]), &( cum_a[ 1 ]), &( max_a[ 1 ])); } else { size_t psize = len / 32 / ncores * 32; #pragma omp parallel for schedule( static, 1 ) \ num_threads( ncores ) for ( int j = 0; j < ncores; j ++ ) { char* start = aligned_start + j * psize; size_t len_ = ( j < ncores - 1 ) ? psize : aligned_len + - j * psize; _helper_a( start, len_, &( acc_a[ j + 1 ]), &( cum_a[ j + 1 ]), &( max_a[ j + 1 ])); } } } int32_t acc = acc_a[ 0 ]; int32_t max = max_a[ 0 ]; for ( int j = 1; j < nparts; j ++ ) { acc += acc_a[ j ]; cum_a[ j ] += cum_a[ j - 1 ]; max_a[ j ] += cum_a[ j - 1 ]; if ( max_a[ j ] > max ) max = max_a[ j ]; } return acc + max + ( buf[ len ] == '1' ); } void _helper_c( char* buf, size_t len, // in int32_t* acc, int32_t* cum, int32_t* max ) { // out for( int i = 0; i < len; i ++ ) { int one = buf[ i ] - '0'; acc += one; cum += one ? -1 : 1; if ( cum > max ) max = cum; } } void _helper_a( char* buf, size_t len, // in int32_t* acc, int32_t* cum, int32_t* max ) { // out void* fin = buf + len; int32_t acc_, cum_, max_; __asm__ ( // " IDDQD \n" " XOR %[acc], %[acc] \n" " MOV $0x00, %%r11 \n" " MOVQ %%r11, %%xmm0 \n" " VPBROADCASTB %%xmm0, %%ymm0 \n" // ymm0 = cum " MOV $0xFF, %%r11 \n" " MOVQ %%r11, %%xmm1 \n" " VPBROADCASTB %%xmm1, %%ymm1 \n" // ymm1 = max " MOV $0x61, %%r11 \n" " MOVQ %%r11, %%xmm2 \n" " VPBROADCASTB %%xmm2, %%ymm2 \n" // ymm2 = 97 x 32 " MOV $0x0F, %%r11 \n" " MOVQ %%r11, %%xmm6 \n" " VPBROADCASTB %%xmm6, %%ymm6 \n" // ymm6 = 15 x 32 " MOV %[buf], %%r8 \n" " MOV %[fin], %%r9 \n" " MOV $32, %%r10 \n" ".L2_: \n" " VMOVDQA (%%r8), %%ymm3 \n" " VPADDB %%ymm3, %%ymm3, %%ymm3 \n" " VPSUBB %%ymm3, %%ymm2, %%ymm3 \n" " VPMOVMSKB %%ymm3, %%ecx \n" " POPCNT %%ecx, %%ecx \n" " ADD %%ecx, %[acc] \n" " VPSLLDQ $1, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $2, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $4, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSLLDQ $8, %%ymm3, %%ymm4 \n" " VPADDSB %%ymm3, %%ymm4, %%ymm3 \n" " VPSHUFB %%ymm6, %%ymm3, %%ymm4 \n" " VPSHUFD $0b01001110, %%ymm3, %%ymm5 \n" " VPMAXSB %%ymm3, %%ymm5, %%ymm3 \n" " VPMOVSXBD %%xmm3, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm5 \n" " VPMAXSD %%ymm1, %%ymm5, %%ymm1 \n" " VPMOVSXBD %%xmm4, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm0 \n" " VPERMQ $0b01001110, %%ymm3, %%ymm3 \n" " VPERMQ $0b01001110, %%ymm4, %%ymm4 \n" " VPMOVSXBD %%xmm3, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm5 \n" " VPMAXSD %%ymm1, %%ymm5, %%ymm1 \n" " VPMOVSXBD %%xmm4, %%ymm5 \n" " VPADDD %%ymm0, %%ymm5, %%ymm0 \n" " ADD %%r10, %%r8 \n" " CMP %%r9, %%r8 \n" " JL .L2_ \n" " VPERMQ $0b00111001, %%ymm1, %%ymm4 \n" " VPMAXSD %%ymm1, %%ymm4, %%ymm1 \n" " VPERMQ $0b01001110, %%ymm1, %%ymm4 \n" " VPMAXSD %%ymm1, %%ymm4, %%ymm1 \n" " PSHUFD $0b00111001, %%xmm1, %%xmm4 \n" " PMAXSD %%xmm4, %%xmm1 \n" " MOVD %%xmm1, %[max] \n" " MOVD %%xmm0, %[cum] \n" : [acc] "=r" ( acc_ ), [cum] "=rm" ( cum_ ), [max] "=rm" ( max_ ) : [buf] "m" ( buf ), [fin] "m" ( fin ) : "cc", "rcx", "r8", "r9", "r10", "r11", "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm6" ); acc = acc_; cum = cum_; *max = max_; } END_OF_C [download]	[reply] [d/l] [select]
Re^4: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 09, 2025 at 17:41 UTC
Re^5: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 12, 2025 at 13:53 UTC
Re^5: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Nov 17, 2025 at 14:54 UTC
Re: Weird performance issue with Strawberries and Inline::C by Anonymous Monk on Oct 16, 2025 at 08:32 UTC
How about this: use strict; use warnings; use feature 'say'; use Benchmark 'cmpthese'; use constant L => 1e5; my @cases; my $str = '1' x L; push @cases, $str; substr $str, rand L, 1 , '0' for 1 .. L; push @cases, $str; push @cases, '01' x ( L / 2 ); say $^V; for my $s ( @cases ) { cmpthese -2, { count => sub { $s =~ tr/1// }, } } __END__ v5.32.1 Rate count count 21711/s -- Rate count count 21607/s -- Rate count count 21348/s -- v5.42.0 Rate count count 16271/s -- Rate count count 2422/s -- Rate count count 8136/s -- [download] String composition affects as trivial operation as counting. Looks like someone not heeded the advice to "write like C3, not C4" and relied on compiler voodoo, when writing Perl itself, not silly solution for PWC. Though of course it may be something else rather than innocent "if" with its jumps. Maybe situation can't be shrugged off after all.	[reply] [d/l]
Re^2: Weird performance issue with Strawberries and Inline::C by ikegami (Patriarch) on Oct 16, 2025 at 15:00 UTC
What's with the rand? Your results are meaningless. The results from the different versions can't be compared to each other, and the results from one version can't be compared to each other. Finally, none of your test cases uses OOK scalars, and the whole question is about OOK scalars.	[reply]

I had a longer answer, but ran it past Claude.AI and it knew a lot more about the topic than I did and my answer would have been wrong. Of course this could be inaccurate/wrong too, but it looks more accurate than what I was going to say. At the risk of offending people, here's is a portion of its answer:

On x86-64, there isn't a direct conditional arithmetic instruction like you describe, but there are several approaches that can eliminate branches:
CMOVcc (Conditional Move) - The most common approach
; Assuming condition result is in a flag
mov eax, i
lea edx, [rax + 1]    ; edx = i + 1
cmovnz eax, edx       ; if condition != 0, eax = edx
mov i, eax
[download]
or more directly:
lea edx, [i + 1]
test condition, condition
cmovnz i, edx         ; conditionally move the incremented value
[download]
SETcc (Set Byte on Condition) - Convert condition to 0/1
xor eax, eax
test condition, condition
setnz al              ; al = (condition != 0) ? 1 : 0
add i, eax            ; i += 0 or 1
[download]
This is often the most efficient for your exact use case - it turns the condition into a 0 or 1, then unconditionally adds it.
ADC/SBB (Add/Subtract with Carry) - If condition is in carry flag
; If your condition sets the carry flag:
adc i, 0              ; i += carry flag
[download]
The SETcc + ADD pattern is typically what compilers generate for branchless if (cond) i++, and it's usually faster than a branch for unpredictable conditions.

[reply]
[d/l]
[select]

Strawberryperl 5.32 used gcc 8.3.0 while 5.42.0.1 uses gcc 13.2.0 which probably accounts for the difference between versions.

From looking at godbolt gcc 8.3.0 optimizes that branch into the conditional expression while 13.2.0 doesn't

[reply]

From looking at godbolt gcc 8.3.0 optimizes that branch into the conditional expression while 13.2.0 doesn't

v5.42.0
String length: 100000
      Rate  c4  c3
c4 13558/s  -- -8%
c3 14740/s  9%  --
[download]

use Inline C => << 'END_OF_C';

use Inline C => Config =>
 # Force recompilation
 FORCE_BUILD => 1,

 # Set CC to to path to gcc.exe version 8.3.0
 CC => 'C:/sp/_64/sp-5.32.0/c/bin/gcc.exe',

 # View build output
 BUILD_NOISY => 1,
;
[download]

v5.42.0
String length: 100000
      Rate  c3  c4
c3 14460/s  -- -0%
c4 14531/s  0%  --
[download]

v5.42.0
String length: 100000
      Rate   c4   c3
c4  3528/s   -- -76%
c3 14777/s 319%   --
[download]

UPDATE:

try2_pl_62b5.c: loadable library and perl binaries are mismatched (got
+ first handshake key 0000000012e00080, needed 0000000012d00080)
[download]

~~Looks like us hackers have now been deprived of yet another liberty.~~

[reply]
[d/l]
[select]

Wow that's a cool website! Never seen that before.

[reply]

You explained that c4 is slower than c3, but it wasn't in 5.32. The question isn't why c4 is slow; it's why it's slow now and not before.

[reply]

I was just making the point that C4 is naturally expected to be slower, to emphasize that you should write it like C3 if you don't want to rely on compiler optimization voodoo. It's nice when compiler optimizations work, but better to not rely on them. And yes I didn't do any of the effort to track down which optimization was lost and whether Perl's tooling was responsible.

[reply]

Thank you, all. FWIW, replacing the starred line with "acc += buf[ i ] - '0';" gives 10% boost over C3 in both 5.32 and 5.42.

[reply]
[d/l]

Show the command used to compile these. The noisy or info option should do that.

use Inline qw( force noisy info );
[download]

perl -MInline=force,noisy,info ...
[download]

Additionally, the output of the following is possibly pertinent:

perl -V:"cc(flags)?"
[download]

(That's an uppercase "V".)

[reply]
[d/l]
[select]

C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL>perl -V:"cc(flags)?"
cc='gcc';
ccflags=' -DWIN32 -DWIN64 -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCR
+IPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv
+ -fno-strict-aliasing -mms-bitfields';

C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL>perl -MInline=force,no
+isy,info test_inline.pl
validate Stage
<-----------------------Information Section---------------------------
+-------->

Information about the processing of your Inline C code:

Your source code needs to be compiled. I'll use this build directory:
C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\_Inline\build\test_inl
+ine_pl_62b5

and I'll install the executable as:
C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\_Inline\lib\auto\test_
+inline_pl_62b5\test_inline_pl_62b5.xs.dll

get_maps Stage
The following Inline C function(s) have been successfully bound to Per
+l:
        int test_c_3(SV * str)
        int test_c_4(SV * str)

<-----------------------End of Information Section--------------------
+-------->
Starting Build Preprocess Stage
Finished Build Preprocess Stage

Starting Build Parse Stage
Finished Build Parse Stage

Starting Build Glue 1 Stage
Finished Build Glue 1 Stage

Starting Build Glue 2 Stage
Finished Build Glue 2 Stage

Starting Build Glue 3 Stage
Finished Build Glue 3 Stage

Starting Build Compile Stage
  Starting "perl Makefile.PL" Stage
Generating a gmake-style Makefile
Writing Makefile for test_inline_pl_62b5
Writing MYMETA.yml and MYMETA.json
  Finished "perl Makefile.PL" Stage

  Starting "make" Stage
Running Mkbootstrap for test_inline_pl_62b5 ()
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command -e chmod -- 644 "test_inline_pl_62b5.bs"
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a
+rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" "C
+:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\lib\ExtUtils/xsub
+pp"  -typemap "C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\l
+ib\ExtUtils\typemap"   test_inline_pl_62b5.xs > test_inline_pl_62b5.x
+sc
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command -e mv -- test_inline_pl_62b5.xsc test_inline_pl_62b
+5.c
gcc -c  -iquote"C:/berrybrew/strawberry-perl-5.32.1.1-64bit-PDL" -DWIN
+32 -DWIN64 -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCRIPTS -DPERL_IM
+PLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-a
+liasing -mms-bitfields -s -O2   -DVERSION=\"0.00\" -DXS_VERSION=\"0.0
+0\"  "-IC:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\lib\CORE
+"   test_inline_pl_62b5.c
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Mksymlists \
     -e "Mksymlists('NAME'=>\"test_inline_pl_62b5\", 'DLBASE' => 'test
+_inline_pl_62b5', 'DL_FUNCS' => {  }, 'FUNCLIST' => [], 'IMPORTS' => 
+{  }, 'DL_VARS' => []);"
g++.exe test_inline_pl_62b5.def -o blib\arch\auto\test_inline_pl_62b5\
+test_inline_pl_62b5.xs.dll -mdll -s -L"C:\berrybrew\strawberry-perl-5
+.32.1.1-64bit-PDL\perl\lib\CORE" -L"C:\berrybrew\strawberry-perl-5.32
+.1.1-64bit-PDL\c\lib" test_inline_pl_62b5.o   "C:\berrybrew\strawberr
+y-perl-5.32.1.1-64bit-PDL\perl\lib\CORE\libperl532.a" -lmoldname -lke
+rnel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -l
+ole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -
+lodbc32 -lodbccp32 -lcomctl32 -Wl,--enable-auto-image-base
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command -e chmod -- 755 blib\arch\auto\test_inline_pl_62b5\
+test_inline_pl_62b5.xs.dll
  Finished "make" Stage

  Starting "make install" Stage
"C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a
+rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644
Files found in blib\arch: installing files in blib\lib into architectu
+re dependent library tree
Installing C:\berrybrew\strawberry-perl-5.32.1.1-64bit-PDL\_Inline\lib
+\auto\test_inline_pl_62b5\test_inline_pl_62b5.xs.dll
  Finished "make install" Stage

  Starting Cleaning Up Stage
  Finished Cleaning Up Stage

Finished Build Compile Stage

v5.32.1
String length: 100000
     Rate  c3  c4
c3 8706/s  -- -0%
c4 8706/s  0%  --
[download]

...

C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL>perl -V:"cc(flags)?"
cc='gcc';
ccflags='-std=c99 -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICI
+TY -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv 
+-fno-strict-aliasing -mms-bitfields';

C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL>perl -MInline=force,no
+isy,info test_inline.pl
validate Stage
<-----------------------Information Section---------------------------
+-------->

Information about the processing of your Inline C code:

Your source code needs to be compiled. I'll use this build directory:
C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\_Inline\build\test_inl
+ine_pl_62b5

and I'll install the executable as:
C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\_Inline\lib\auto\test_
+inline_pl_62b5\test_inline_pl_62b5.xs.dll

get_maps Stage
The following Inline C function(s) have been successfully bound to Per
+l:
        int test_c_3(SV * str)
        int test_c_4(SV * str)

<-----------------------End of Information Section--------------------
+-------->
Starting Build Preprocess Stage
Finished Build Preprocess Stage

Starting Build Parse Stage
Finished Build Parse Stage

Starting Build Glue 1 Stage
Finished Build Glue 1 Stage

Starting Build Glue 2 Stage
Finished Build Glue 2 Stage

Starting Build Glue 3 Stage
Finished Build Glue 3 Stage

Starting Build Compile Stage
  Starting "perl Makefile.PL" Stage
Generating a gmake-style Makefile
Writing Makefile for test_inline_pl_62b5
Writing MYMETA.yml and MYMETA.json
  Finished "perl Makefile.PL" Stage

  Starting "make" Stage
Running Mkbootstrap for test_inline_pl_62b5 ()
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command -e chmod -- 644 "test_inline_pl_62b5.bs"
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a
+rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" "C
+:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\lib\ExtUtils/xsub
+pp"  -typemap "C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\l
+ib\ExtUtils\typemap"   test_inline_pl_62b5.xs > test_inline_pl_62b5.x
+sc
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command -e mv -- test_inline_pl_62b5.xsc test_inline_pl_62b
+5.c
gcc -c  -iquote"C:/berrybrew/strawberry-perl-5.42.0.1-64bit-PDL" -std=
+c99 -DWIN32 -DWIN64 -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMP
+LICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv -fno-strict-a
+liasing -mms-bitfields -O2   -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\"
+  "-IC:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\lib\CORE"  
+ test_inline_pl_62b5.c
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Mksymlists \
     -e "Mksymlists('NAME'=>\"test_inline_pl_62b5\", 'DLBASE' => 'test
+_inline_pl_62b5', 'DL_FUNCS' => {  }, 'FUNCLIST' => [], 'IMPORTS' => 
+{  }, 'DL_VARS' => []);"
g++.exe test_inline_pl_62b5.def -o blib\arch\auto\test_inline_pl_62b5\
+test_inline_pl_62b5.xs.dll -mdll -s -L"C:\berrybrew\strawberry-perl-5
+.42.0.1-64bit-PDL\perl\lib\CORE" -L"C:\berrybrew\strawberry-perl-5.42
+.0.1-64bit-PDL\c\lib" test_inline_pl_62b5.o   "C:\berrybrew\strawberr
+y-perl-5.42.0.1-64bit-PDL\perl\lib\CORE\libperl542.a" -lmoldname -lke
+rnel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -l
+ole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -
+lodbc32 -lodbccp32 -lcomctl32 -Wl,--enable-auto-image-base
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command -e chmod -- 755 blib\arch\auto\test_inline_pl_62b5\
+test_inline_pl_62b5.xs.dll
  Finished "make" Stage

  Starting "make install" Stage
"C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\perl\bin\perl.exe" -M
+ExtUtils::Command::MM -e cp_nonempty -- test_inline_pl_62b5.bs blib\a
+rch\auto\test_inline_pl_62b5\test_inline_pl_62b5.bs 644
Files found in blib\arch: installing files in blib\lib into architectu
+re dependent library tree
Installing C:\berrybrew\strawberry-perl-5.42.0.1-64bit-PDL\_Inline\lib
+\auto\test_inline_pl_62b5\test_inline_pl_62b5.xs.dll
  Finished "make install" Stage

  Starting Cleaning Up Stage
  Finished Cleaning Up Stage

Finished Build Compile Stage

v5.42.0
String length: 100000
     Rate   c4   c3
c4 2456/s   -- -72%
c3 8706/s 254%   --
[download]

[reply]
[d/l]
[select]

I see -O2 in both cases. It would be interesting to see the disassembly, but I'm not sure how to get that.

[reply]
[d/l]

Here's more on compiler voodoo, though it involves a very trivial problem and primitive code; I either publish it (it's Perl related after all) or just throw it away. The PWC-342-2 was all about finding a maximum (an extremum, depends on POV) over cumulative sums. This is camouflaged, in my pure-Perl solution (below) by use of a regex; which is fastest (?) (over PWC GH directory) and readable OK. At the same time I wrote a PDL solution (rather, Perl/PDL combo); which is of course faster and no less (?) readable.

Then (2 weeks ago?), because confused about OOK and "4-args-substr" (not used below; and the confusion cleared now, thanks to the answers to another SOPW question), I wrote a "straightforward" C solution, followed by a tweaked (through simple arithmetic) -- "c_better" -- solution.

Further yet, I tried another -- Fortran -- language solution, pluggable directly into Perl's Benchmark harness thanks to Inline::C::Cookbook recipe. The c2f subroutine is almost exact, line-by-line translation of c_better to Fortran, and yet it's _faster_, don't know why, by the same margin regardless of -O2 or -O3 used to compile it, and despite of a C wrapper used to call it. (But, for the record, the "f" subroutine(s) require -O3 to be the fastest.)

Then, inspired by this success (?), I tried the direct, albeit perhaps mechanical, translation of PDL solution to Fortran. The latter doesn't have the cumulative sums intrinsic, so this one I had to do as an explicit loop. Unlike my C and "c2f", it performs several passes over input data (create a mask, count, transliterate, do cum sums, find a minimum) and stores intermediate arrays. However, it's _fastest_, for 1e4 and 1e5 input.

use strict;
use warnings;
use feature 'say';
use Config;
use Benchmark 'cmpthese';

use lib '.';
use Max_score_subs ':all';

say "$^V / $Config{ archname } / $Config{ gccversion }";

my $str;
my %tests = (
    perl     => sub { mscore_perl( $str )},
    pdl      => sub { mscore_pdl( $str )},
    c        => sub { mscore_c( $str )},
    c_better => sub { mscore_c_better( $str )},
    c2f      => sub { mscore_c2f( $str )},
    f        => sub { mscore_f( $str )},
);

for my $L ( 1e3, 1e4, 1e5, 1e6, 1e7, 1e8 ) {
    say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr;
    
    $str = '1' x $L;
    substr $str, rand $L, 1 , '0' for 1 .. $L;
    
    delete $tests{ perl } if $L > 1e6;
    delete $tests{ pdl }  if $L > 1e7;
    
    cmpthese -1, \%tests;
}

__END__


v5.42.0 / MSWin32-x64-multi-thread / 13.2.0

String length: 1,000
             Rate     perl      pdl        c c_better        f      c2
+f
perl       6637/s       --     -90%     -99%     -99%     -99%     -99
+%
pdl       67389/s     915%       --     -91%     -92%     -92%     -93
+%
c        764229/s   11415%    1034%       --      -8%     -13%     -16
+%
c_better 830711/s   12416%    1133%       9%       --      -5%      -9
+%
f        876596/s   13108%    1201%      15%       6%       --      -4
+%
c2f      911910/s   13640%    1253%      19%      10%       4%       -
+-

String length: 10,000
             Rate     perl      pdl        c c_better      c2f        
+f
perl        635/s       --     -94%     -99%     -99%     -99%     -99
+%
pdl       11113/s    1651%       --     -87%     -88%     -89%     -90
+%
c         86164/s   13476%     675%       --      -9%     -18%     -26
+%
c_better  94730/s   14826%     752%      10%       --      -9%     -19
+%
c2f      104655/s   16390%     842%      21%      10%       --     -10
+%
f        116531/s   18261%     949%      35%      23%      11%       -
+-

String length: 100,000
            Rate     perl      pdl        c c_better      c2f        f
perl      63.1/s       --     -95%     -99%     -99%     -99%     -99%
pdl       1161/s    1741%       --     -87%     -88%     -89%     -91%
c         8934/s   14061%     669%       --      -8%     -18%     -28%
c_better  9692/s   15262%     734%       8%       --     -11%     -22%
c2f      10925/s   17216%     841%      22%      13%       --     -12%
f        12350/s   19475%     963%      38%      27%      13%       --

String length: 1,000,000
           Rate     perl      pdl        f        c c_better      c2f
perl     6.31/s       --     -94%     -99%     -99%     -99%     -99%
pdl       104/s    1554%       --     -80%     -88%     -89%     -90%
f         535/s    8370%     412%       --     -38%     -45%     -51%
c         865/s   13600%     728%      62%       --     -11%     -20%
c_better  968/s   15243%     828%      81%      12%       --     -11%
c2f      1085/s   17095%     940%     103%      26%      12%       --

String length: 10,000,000
           Rate      pdl        f        c c_better      c2f
pdl      10.1/s       --     -76%     -89%     -90%     -90%
f        41.5/s     311%       --     -53%     -57%     -61%
c        88.7/s     778%     114%       --      -8%     -16%
c_better 96.4/s     854%     132%       9%       --      -9%
c2f       106/s     947%     155%      19%      10%       --

String length: 100,000,000
           Rate        f        c c_better      c2f
f        4.39/s       --     -51%     -54%     -58%
c        8.87/s     102%       --      -7%     -14%
c_better 9.55/s     118%       8%       --      -8%
c2f      10.3/s     136%      17%       8%       --
[download]

Max_score_subs.pm:

package Max_score_subs;

use strict;
use warnings;
use feature 'say';
use PDL;

use Exporter 'import';

our %EXPORT_TAGS = ( 
    all => [ qw/
        mscore_perl
        mscore_pdl
        mscore_c
        mscore_c_better
        mscore_c2f
        mscore_f
    /],
);

Exporter::export_ok_tags( 'all' );

sub mscore_perl {
    my $s = shift;
    
    my $max = '0' eq substr $s, 0, 1;
    $s = substr $s, 1;
    $max += $s =~ tr/1//;
    chop $s;
    
    my $score = $max;
    while ( $s =~ /(0*)(1*)/g ) {
        $score += length $1;
        $max = $score if $score > $max;
        $score -= length $2
    }
    return $max
}

sub mscore_pdl {
    my $s = shift;
    
    my $count = $s =~ tr/1//; 
    chop $s;
    $s =~ tr/01/\xff\1/;

    my $p = zeroes sbyte, length $s;
    ${ $p-> get_dataref } = $s;
    $p-> upd_data;
    
    $count - $p-> cumusumover
               -> minover
               -> sclr
}

use FindBin;
use Inline Config => force_build => 1;
use Inline C => Config =>
    libs => "-L$FindBin::Bin -lmscore_f -lgfortran -lquadmath";
use Inline C => << 'END_OF_C';

extern void c2f_( void* n, void* s, void* ret );
extern void f_( void* n, void* s, void* ret );

int mscore_c( SV* str ) {
    int i, acc, score, max, one;
    
    STRLEN len;
    char* buf = SvPV( str, len );
    len --;
    
    acc = 0;
    score = 0;
    max = -1;
    for( i = 0; i < len; i ++ ) {
        one = buf[ i ] - '0';
        acc += one;
        score += one ? -1 : 1;
        max = score > max ? score : max;
    }
    return max + acc + ( buf[ i ] == '1' );
}

int mscore_c_better( SV* str ) {
    int i, acc, tmp, max;
    
    STRLEN len;
    char* buf = SvPVbyte( str, len );
    len --;
    
    acc = 0;
    max = -2;
    for( i = 0; i < len; i ++ ) {
        acc += buf[ i ] - '0';
        tmp = i - 2 * acc;
        max = tmp > max ? tmp : max;
    }
    return max + acc + ( buf[ len ] == '1' ) + 1;
}

int mscore_c2f( SV* str ) {
    STRLEN len;
    char* buf = SvPVbyte( str, len );
    
    int ret;
    c2f_( &len, buf, &ret );
    return ret;
}

int mscore_f( SV* str ) {
    STRLEN len;
    char* buf = SvPVbyte( str, len );
    
    int ret;
    f_( &len, buf, &ret );
    return ret;
}

END_OF_C

1;
[download]

mscore_f.f90:

subroutine c2f( n, s, ret )
    integer,     intent( in )  :: n
    integer * 1, intent( in )  :: s( n )
    integer,     intent( out ) :: ret
    
    integer :: i, acc, mx, tmp
    
    acc = 0
    mx = -2
    do i = 1, n - 1
        acc = acc + ( s( i ) - iachar( '0', 1 ))
        tmp = i - 1 - 2 * acc
        mx = max( mx, tmp )
    end do
    
    ret = acc + mx + ( s( n ) - iachar( '0', 1 )) + 1
end subroutine c2f

subroutine f( n, s, ret )
    integer,     intent( in )  :: n
    integer * 1, intent( in )  :: s( n )
    integer,     intent( out ) :: ret
    
    integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1    ! 97
    
    integer     :: i
    integer * 1 :: a( n )           ! temporary pad
    integer     :: cum( n )         ! cumulative sums
    
    ret = count( s == iachar( '1', 1 ))     ! tr/1//
    a = 2 * s - A97                         ! tr/01/\xff\1/r
    a( n ) = 0                              ! "chop", kind of
    
    cum( 1 ) = a( 1 )
    do i = 2, n
        cum( i ) = cum( i - 1 ) + a( i )
    end do
    
    ret = ret - minval( cum )
end subroutine f
[download]

The jitter for fast contestants for very short (1e3) input is perhaps within margin of error; but what about "f" getting slower for 1e6 and above? I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything?

Anyway, it then occurred to me, the scratch-pads can be much smaller and re-used many times, both for a single call (i.e. cumulative sums and minimums perhaps to be done in (quite) a few steps, following the consecutive nature of the problem all the same) and between calls, because declared as module variables i.e. the "mscore_f.f90" re-written as a module, per modern fashion:

module mscore_f
implicit none

private
integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1    ! 97

integer, parameter :: CHUNK = 25000
integer * 1        :: a( CHUNK )            ! scratch-pad #1
integer            :: cum( CHUNK )          ! scratch-pad #2

public :: f, c2f

contains

subroutine c2f( n, s, ret )

    ! skipped, the same code

end subroutine c2f

subroutine f( n, s, ret )
    integer,     intent( in )  :: n
    integer * 1, intent( in )  :: s( n )
    integer,     intent( out ) :: ret
    
    integer :: i, j, m, pre, L
    integer :: rprev                    ! "running previous"
    integer :: rmin                     ! "running minimum"
    
    ret = count( s == iachar( '1', 1 ))
    
    m = n / CHUNK
    if ( mod( n, CHUNK ) .ne. 0 ) then
        m = m + 1
    end if
    
    rprev = 0;
    rmin = 2;
    do j = 1, m
        pre = ( j - 1 ) * CHUNK
        L = min( CHUNK, ubound( s, 1 ) - pre - 1 )
        
        a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97
        
        cum( 1 ) = rprev + a( 1 )
        do i = 2, L
            cum( i ) = cum( i - 1 ) + a( i )
        end do
        
        rmin = min( rmin, minval( cum( 1 : L )))
        rprev = cum( L )
    end do
    
    ret = ret - rmin
end subroutine f

end module mscore_f
[download]

And within Max_score_subs.pm:

extern void __mscore_f_MOD_c2f( void* n, void* s, void* ret );
extern void __mscore_f_MOD_f( void* n, void* s, void* ret );
[download]

Then, with relevant contestants, the results were quite a wow moment for me:

v5.42.0 / MSWin32-x64-multi-thread / 13.2.0

String length: 1,000
              Rate c_better      c2f        f
c_better  829929/s       --      -8%     -17%
c2f       905443/s       9%       --     -10%
f        1003704/s      21%      11%       --

String length: 10,000
             Rate c_better      c2f        f
c_better  94179/s       --     -11%     -22%
c2f      105326/s      12%       --     -13%
f        120618/s      28%      15%       --

String length: 100,000
            Rate c_better      c2f        f
c_better  9683/s       --     -12%     -22%
c2f      10983/s      13%       --     -12%
f        12418/s      28%      13%       --

String length: 1,000,000
           Rate c_better      c2f        f
c_better  969/s       --     -10%     -22%
c2f      1082/s      12%       --     -13%
f        1246/s      28%      15%       --

String length: 10,000,000
           Rate c_better      c2f        f
c_better 95.1/s       --      -9%     -20%
c2f       104/s      10%       --     -12%
f         119/s      25%      14%       --

String length: 100,000,000
           Rate c_better      c2f        f
c_better 9.55/s       --      -9%     -18%
c2f      10.5/s      10%       --     -10%
f        11.6/s      22%      11%       --
[download]

[reply]
[d/l]
[select]

I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything?

At least it's easy to observe that memory is returned to OS upon return from a subroutine, where VLA (variable length array) is declared (and used) -- _if_ array gets large enough such as ~130K 64-bit integers. And _not_ returned to OS for smaller arrays. Don't know if/when other factors may be at play; and I doubt this factoid has any value. But it explains why the original "f" subroutine had been slower for 1e6 and longer input.

Let's aim higher than ~40% gain through tweaks and compiler voodoo -- let's get parallel!

use strict;
use warnings;
use feature 'say';
use Config;
use Benchmark qw/ timeit :hireswallclock /;

use lib '.';
use Max_score_subs ':all';

say "$^V / $Config{ archname } / $Config{ gccversion }";

my $str;
my %tests = (
                                                #   C:
    c        => sub { mscore_c( $str )},        # - dumb
    c_better => sub { mscore_c_better( $str )}, # - tweaked
    c_omp    => sub { mscore_c_omp( $str )},    # - "c", parallel
                                                #
                                                #   Fortran:
    c2f      => sub { mscore_c2f( $str )},      # - "c_better" (in F)
    f        => sub { mscore_f( $str )},        # - "PDL" (in F)
    f_omp    => sub { mscore_f_omp( $str )},    # - "f", parallel
);

my $iters = 2e5;
my $fmt = '%8.0f';

for my $L ( 1e4, 1e5, 1e6, 1e7, 1e8 ) {
    say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr;
    
    $str = '1' x $L;
    substr $str, rand $L, 1 , '0' for 1 .. $L;

    my $any;
    my %ret;
    for my $key ( keys %tests ) {
        my $result = $tests{ $key }-> ();
        die "<<< $key!!! >>>" unless $result == ( $any //= $result );
        
        my $t = timeit( $iters, $tests{ $key });
        $ret{ $key } = $t-> iters / $t-> real
    }
    
    print "             Rate/s    %\n";
    $fmt = '%8.1f' if $L > 1e6;
    for my $key ( sort { $ret{ $a } <=> $ret{ $b }} keys %ret ) {
        printf " %-9s $fmt %5.0f\n", 
            $key, $ret{ $key }, 100 * $ret{ $key } / $ret{ c }
    }
    $iters /= 10
}

__END__

v5.42.0 / MSWin32-x64-multi-thread / 13.2.0

String length: 10,000
             Rate/s    %
 f_omp        23804    28
 c_omp        25362    29
 c            86100   100
 c_better     91470   106
 c2f         105239   122
 f           120134   140

String length: 100,000
             Rate/s    %
 c             8724   100
 c_better      9324   107
 c2f          10686   122
 f            12139   139
 c_omp        15369   176
 f_omp        15964   183

String length: 1,000,000
             Rate/s    %
 c              875   100
 c_better       931   106
 c2f           1068   122
 f             1213   139
 c_omp         2306   264
 f_omp         2560   293

String length: 10,000,000
             Rate/s    %
 c             86.8   100
 c_better      92.2   106
 c2f          104.5   120
 f            115.6   133
 c_omp        349.6   403
 f_omp        426.9   492

String length: 100,000,000
             Rate/s    %
 c              8.6   100
 c_better       9.2   106
 c2f           10.4   120
 f             11.4   131
 c_omp         34.7   401
 f_omp         42.4   491
[download]

What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch. In hindsight, the improved "c_better" and its translation "c2f" were a trap. With their loops, it's not obvious how to split them in independent parts. OTOH, for the "dumb" C version it's not too difficult to split:

int mscore_c_omp( SV* str ) {
    STRLEN len;
    char* buf = SvPVbyte( str, len );
    len --;
    
    int nparts = omp_get_num_procs();
    if ( nparts > 4 ) nparts = 4;
    int psize = len / nparts;
    
    int acc_a[ nparts ];
    int cum_a[ nparts ];
    int max_a[ nparts ];
    
    #pragma omp parallel for schedule( static, 1 ) \
                             num_threads( nparts )
    for ( int j = 0; j < nparts; j ++ ) {
        int lo = j * psize;
        int hi = ( j == nparts - 1 ) ? len : lo + psize;
        
        int acc = 0;
        int cum = 0;
        int max = -1;
        
        for ( int i = lo; i < hi; i ++ ) {
            int one = buf[ i ] - '0';
            acc += one;
            cum += one ? -1 : 1;
            if ( cum > max ) max = cum;
        }
        acc_a[ j ] = acc;
        cum_a[ j ] = cum;
        max_a[ j ] = max;
    }
    
    int acc = acc_a[ 0 ]; 
    int max = max_a[ 0 ];
    
    for ( int j = 1; j < nparts; j ++ ) {
        acc += acc_a[ j ];
        
        cum_a[ j ] += cum_a[ j - 1 ];
        max_a[ j ] += cum_a[ j - 1 ];
        
        if ( max_a[ j ] > max ) max = max_a[ j ];
    }
    
    return acc + max + ( buf[ len ] == '1' );
}
[download]

In my tests, a string length of ~70k is a cut-off value where OMP versions begin to outrun the original versions. And it's better to just switch to originals rather than adding "if" directive to omp pragma, the above function is somewhat cumbersome and slow. Speaking about cumbersome, the Fortran OMP-enabled sub is a long way from original (elegant) PDL version. At least there remains ~25% gain over "c_omp", through the best of both worlds: parallel with OMP, and compiler voodoo with vectorization.

subroutine f_omp( n, s, ret )
    integer,     intent( in )  :: n
    integer * 1, intent( in )  :: s( n )
    integer,     intent( out ) :: ret
    
    integer              :: j, nparts, psize
    integer, allocatable :: acc_a ( : ), cum_a( : ), min_a( : )
    
    nparts = omp_get_num_procs()
    if ( nparts > 4 ) nparts = 4
    psize = n / nparts;
    
    allocate( acc_a( nparts ))
    allocate( cum_a( nparts ))
    allocate( min_a( nparts ))
    
    !$omp parallel do schedule( static, 1 ) num_threads( nparts )
    do j = 1, nparts ; block
        integer * 1, allocatable :: a( : )
        integer, allocatable     :: cum( : )
        
        integer :: lo, hi, d, nchunks
        integer :: pre, L, i, k
        integer :: acc_, cum_, min_
        
        lo = ( j - 1 ) * psize
        if ( j == nparts ) then
            hi = n - 1
        else
            hi = j * psize
        end if
        d = hi - lo
        nchunks = d / CHUNK
        if ( mod( d, CHUNK ) .ne. 0 ) nchunks = nchunks + 1
        
        acc_ = 0;
        cum_ = 0;
        min_ = 2;
        
        allocate( a( CHUNK ))
        allocate( cum( CHUNK ))
        
        do k = 1, nchunks
            pre = lo + ( k - 1 ) * CHUNK
            L = min( CHUNK, hi - pre )
            
            a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97
            cum( 1 ) = cum_ + a( 1 )
            do i = 2, L
                cum( i ) = cum( i - 1 ) + a( i )
            end do
            
            acc_ = acc_ + count( s( pre + 1 : pre + L ) &
                                == iachar( '1', 1 ))
            cum_ = cum( L )
            min_ = min( min_, minval( cum( 1 : L )))
        end do
        
        acc_a( j ) = acc_
        cum_a( j ) = cum_
        min_a( j ) = min_
        
    end block ; end do
    !$omp end parallel do
    
    do j = 2, nparts
        cum_a( j ) = cum_a( j ) + cum_a( j - 1 )
        min_a( j ) = min_a( j ) + cum_a( j - 1 )
    end do
    
    ret = sum( acc_a ) + ( s( n ) - iachar( '0', 1 )) &
          - minval( min_a )
end subroutine f_omp
[download]

[reply]
[d/l]
[select]

(OP here again, with long-forgotten PWC 342-2, exploring depths in muddy waters around 4-5 lines of trivial C)

What a pity, I only have 4 cores and can't achieve more than 4x increase, would be fun to watch

Fun should never be denied:

v5.42.0 / MSWin32-x64-multi-thread / 13.2.0

String length: 10,000
             Rate/s    %
 c            87663   100
 va          622005   710

String length: 100,000
             Rate/s    %
 c             8875   100
 c_omp        15439   174
 va           71646   807

String length: 1,000,000
             Rate/s    %
 c              891   100
 c_omp         2307   259
 va           12770  1434

String length: 10,000,000
             Rate/s    %
 c             88.3   100
 c_omp        361.2   409
 va          1615.6  1829

String length: 100,000,000
             Rate/s    %
 c              8.8   100
 c_omp         36.2   410
 va           167.7  1900
[download]

'c' and 'c_omp' are for 'mscore_c' (straightforward i.e. "dumb") and 'mscore_c_omp' found in this thread. The 'va' stands for 'vectors, assembly'; where OMP (I have 4 cores) is "automatically" switched on for strings longer than 4e5. CPU is supposed to support AVX2 i.e. to be 2017+.

There's a "cheat": sequential runs of either '0' or '1's, in target string, shouldn't be longer than 2**31. Because current (running) scores are maintained (dragged along) as 8 per 256-bit registers. Perhaps there is more optimal way to find (horizontal) cumulative sums along 16 bytes, other than 4 shifts/shuffles and 4 additions. What's really funny is that to find (rather, maintain) running maximum over 32 numbers I'm only doing 3 comparisons. But maybe all of the above (i.e. below) looks not funny but ugly to "people who know how".

use Inline C => Config =>
    ccflagsex => '-fopenmp',
    libs      => '-lgomp -ldl';
use Inline C => << 'END_OF_C';

#include <omp.h>

#define MANY 400000
#define MAX_CORES 4

void _helper_a( char* buf, size_t len,
                int32_t* acc, int32_t* cum, int32_t* max );
void _helper_c( char* buf, size_t len,
                int32_t* acc, int32_t* cum, int32_t* max );

int mscore_va( SV* str ) {
    STRLEN len;
    char* buf = SvPVbyte( str, len );

    int ncores = 1;
    if ( len >= MANY ) {
        int n = omp_get_num_procs();
        ncores = n > MAX_CORES ? MAX_CORES : n;
    }
    int nparts = 2 + ncores;

    int32_t acc_a[ nparts ]; 
    int32_t cum_a[ nparts ]; 
    int32_t max_a[ nparts ]; 
    memset( acc_a, 0x00, sizeof( acc_a ));
    memset( cum_a, 0x00, sizeof( cum_a ));
    memset( max_a, 0xFF, sizeof( max_a ));
    
    len --;
    
    int32_t   prefix_len    = ( size_t ) buf % 32;
    if ( len < prefix_len ) prefix_len = len;
    int32_t   suffix_len    = ( len - prefix_len ) % 32;
    int32_t   aligned_len   = len - prefix_len - suffix_len;
    char*     prefix_start  = buf;
    char*     aligned_start = buf + prefix_len;
    char*     suffix_start  = aligned_start + aligned_len;
    
    if ( prefix_len > 0 ) {
        _helper_c( prefix_start, prefix_len,
            acc_a, cum_a, max_a );
    }
    if ( suffix_len > 0 ) {
        _helper_c( suffix_start, suffix_len,
            &( acc_a[ nparts - 1 ]),
            &( cum_a[ nparts - 1 ]),
            &( max_a[ nparts - 1 ]));
    }
    if ( aligned_len > 0 ) {
        if ( ncores == 1 ) {
            _helper_a( aligned_start, aligned_len,
                &( acc_a[ 1 ]),
                &( cum_a[ 1 ]),
                &( max_a[ 1 ]));
        }
        else {
            size_t psize = len / 32 / ncores * 32;
            
            #pragma omp parallel for schedule( static, 1 ) \
                                     num_threads( ncores )
            for ( int j = 0; j < ncores; j ++ ) {
                char* start = aligned_start + j * psize;
                size_t len_ = ( j < ncores - 1 ) ? psize : aligned_len
+ - j * psize;
                
                _helper_a( start, len_,
                    &( acc_a[ j + 1 ]),
                    &( cum_a[ j + 1 ]),
                    &( max_a[ j + 1 ]));
            }
        }
    }

    int32_t acc = acc_a[ 0 ]; 
    int32_t max = max_a[ 0 ];

    for ( int j = 1; j < nparts; j ++ ) {
        acc += acc_a[ j ];

        cum_a[ j ] += cum_a[ j - 1 ];
        max_a[ j ] += cum_a[ j - 1 ];
        
        if ( max_a[ j ] > max ) max = max_a[ j ];
    }
    
    return acc + max + ( buf[ len ] == '1' );
}

void _helper_c( char* buf, size_t len,                          // in
                int32_t* acc, int32_t* cum, int32_t* max ) {    // out
    for( int i = 0; i < len; i ++ ) {
        int one = buf[ i ] - '0';
        *acc += one;
        *cum += one ? -1 : 1;
        if ( *cum > *max ) *max = *cum;
    }
}

void _helper_a( char* buf, size_t len,                          // in
                int32_t* acc, int32_t* cum, int32_t* max ) {    // out

    void* fin = buf + len;
    int32_t acc_, cum_, max_;

    __asm__ ( 
//  "    IDDQD                          \n"
    "    XOR            %[acc], %[acc]  \n"
    "    MOV            $0x00,  %%r11   \n"
    "    MOVQ           %%r11,  %%xmm0  \n"
    "    VPBROADCASTB   %%xmm0, %%ymm0  \n"     // ymm0 = cum
    "    MOV            $0xFF,  %%r11   \n"
    "    MOVQ           %%r11,  %%xmm1  \n"
    "    VPBROADCASTB   %%xmm1, %%ymm1  \n"     // ymm1 = max
    "    MOV            $0x61,  %%r11   \n"
    "    MOVQ           %%r11,  %%xmm2  \n"
    "    VPBROADCASTB   %%xmm2, %%ymm2  \n"     // ymm2 = 97 x 32
    "    MOV            $0x0F,  %%r11   \n"
    "    MOVQ           %%r11,  %%xmm6  \n"
    "    VPBROADCASTB   %%xmm6, %%ymm6  \n"     // ymm6 = 15 x 32

    "    MOV            %[buf], %%r8    \n"
    "    MOV            %[fin], %%r9    \n"
    "    MOV            $32,    %%r10   \n"
    ".L2_:                                          \n"
    "       VMOVDQA     (%%r8),  %%ymm3             \n"
    "       VPADDB      %%ymm3,  %%ymm3,    %%ymm3  \n"
    "       VPSUBB      %%ymm3,  %%ymm2,    %%ymm3  \n"
    "       VPMOVMSKB   %%ymm3,  %%ecx              \n"
    "       POPCNT      %%ecx,   %%ecx              \n"
    "       ADD         %%ecx,   %[acc]             \n"

    "       VPSLLDQ     $1,      %%ymm3,    %%ymm4  \n"
    "       VPADDSB     %%ymm3,  %%ymm4,    %%ymm3  \n"
    "       VPSLLDQ     $2,      %%ymm3,    %%ymm4  \n"
    "       VPADDSB     %%ymm3,  %%ymm4,    %%ymm3  \n"
    "       VPSLLDQ     $4,      %%ymm3,    %%ymm4  \n"
    "       VPADDSB     %%ymm3,  %%ymm4,    %%ymm3  \n"
    "       VPSLLDQ     $8,      %%ymm3,    %%ymm4  \n"
    "       VPADDSB     %%ymm3,  %%ymm4,    %%ymm3  \n"

    "       VPSHUFB     %%ymm6,  %%ymm3,    %%ymm4  \n"

    "       VPSHUFD    $0b01001110, %%ymm3, %%ymm5  \n"
    "       VPMAXSB    %%ymm3,      %%ymm5, %%ymm3  \n"

    "       VPMOVSXBD  %%xmm3,      %%ymm5          \n"
    "       VPADDD     %%ymm0,      %%ymm5, %%ymm5  \n"
    "       VPMAXSD    %%ymm1,      %%ymm5, %%ymm1  \n"
    "       VPMOVSXBD  %%xmm4,      %%ymm5          \n"
    "       VPADDD     %%ymm0,      %%ymm5, %%ymm0  \n"

    "       VPERMQ     $0b01001110, %%ymm3, %%ymm3  \n"
    "       VPERMQ     $0b01001110, %%ymm4, %%ymm4  \n"

    "       VPMOVSXBD  %%xmm3,      %%ymm5          \n"
    "       VPADDD     %%ymm0,      %%ymm5, %%ymm5  \n"
    "       VPMAXSD    %%ymm1,      %%ymm5, %%ymm1  \n"
    "       VPMOVSXBD  %%xmm4,      %%ymm5          \n"
    "       VPADDD     %%ymm0,      %%ymm5, %%ymm0  \n"

    "       ADD        %%r10,       %%r8            \n"
    "       CMP        %%r9,        %%r8            \n"
    "       JL         .L2_                         \n"

    "    VPERMQ     $0b00111001,    %%ymm1, %%ymm4  \n"
    "    VPMAXSD    %%ymm1,         %%ymm4, %%ymm1  \n"
    "    VPERMQ     $0b01001110,    %%ymm1, %%ymm4  \n"
    "    VPMAXSD    %%ymm1,         %%ymm4, %%ymm1  \n"
    "    PSHUFD     $0b00111001,    %%xmm1, %%xmm4  \n"
    "    PMAXSD     %%xmm4,         %%xmm1          \n"

    "    MOVD       %%xmm1,         %[max]          \n"
    "    MOVD       %%xmm0,         %[cum]          \n"

        : [acc] "=r"  ( acc_ ),
          [cum] "=rm" ( cum_ ),
          [max] "=rm" ( max_ )
        : [buf] "m"   ( buf ),
          [fin] "m"   ( fin )
        : "cc", 
          "rcx", "r8", "r9", "r10", "r11",
          "ymm0", "ymm1", "ymm2", "ymm3", "ymm4", "ymm5", "ymm6"
    );
    
    *acc = acc_;
    *cum = cum_;
    *max = max_;
}

END_OF_C
[download]

[reply]
[d/l]
[select]

How about this:

use strict;
use warnings;
use feature 'say';

use Benchmark 'cmpthese';

use constant L => 1e5;

my @cases;

my $str = '1' x L;
push @cases, $str;

substr $str, rand L, 1 , '0' for 1 .. L;
push @cases, $str;
push @cases, '01' x ( L / 2 );

say $^V;

for my $s ( @cases ) {
    cmpthese -2, {
        count => sub { $s =~ tr/1// },
    }
}

__END__

v5.32.1
         Rate count
count 21711/s    --
         Rate count
count 21607/s    --
         Rate count
count 21348/s    --


v5.42.0
         Rate count
count 16271/s    --
        Rate count
count 2422/s    --
        Rate count
count 8136/s    --
[download]

String composition affects as trivial operation as counting. Looks like someone not heeded the advice to "write like C3, not C4" and relied on compiler voodoo, when writing Perl itself, not silly solution for PWC. Though of course it may be something else rather than innocent "if" with its jumps. Maybe situation can't be shrugged off after all.

[reply]
[d/l]

What's with the rand? Your results are meaningless. The results from the different versions can't be compared to each other, and the results from one version can't be compared to each other. Finally, none of your test cases uses OOK scalars, and the whole question is about OOK scalars.

[reply]