in reply to Weird performance issue with Strawberries and Inline::C
Here's more on compiler voodoo, though it involves a very trivial problem and primitive code; I either publish it (it's Perl related after all) or just throw it away. The PWC-342-2 was all about finding a maximum (an extremum, depends on POV) over cumulative sums. This is camouflaged, in my pure-Perl solution (below) by use of a regex; which is fastest (?) (over PWC GH directory) and readable OK. At the same time I wrote a PDL solution (rather, Perl/PDL combo); which is of course faster and no less (?) readable.
Then (2 weeks ago?), because confused about OOK and "4-args-substr" (not used below; and the confusion cleared now, thanks to the answers to another SOPW question), I wrote a "straightforward" C solution, followed by a tweaked (through simple arithmetic) -- "c_better" -- solution.
Further yet, I tried another -- Fortran -- language solution, pluggable directly into Perl's Benchmark harness thanks to Inline::C::Cookbook recipe. The c2f subroutine is almost exact, line-by-line translation of c_better to Fortran, and yet it's _faster_, don't know why, by the same margin regardless of -O2 or -O3 used to compile it, and despite of a C wrapper used to call it. (But, for the record, the "f" subroutine(s) require -O3 to be the fastest.)
Then, inspired by this success (?), I tried the direct, albeit perhaps mechanical, translation of PDL solution to Fortran. The latter doesn't have the cumulative sums intrinsic, so this one I had to do as an explicit loop. Unlike my C and "c2f", it performs several passes over input data (create a mask, count, transliterate, do cum sums, find a minimum) and stores intermediate arrays. However, it's _fastest_, for 1e4 and 1e5 input.
use strict; use warnings; use feature 'say'; use Config; use Benchmark 'cmpthese'; use lib '.'; use Max_score_subs ':all'; say "$^V / $Config{ archname } / $Config{ gccversion }"; my $str; my %tests = ( perl => sub { mscore_perl( $str )}, pdl => sub { mscore_pdl( $str )}, c => sub { mscore_c( $str )}, c_better => sub { mscore_c_better( $str )}, c2f => sub { mscore_c2f( $str )}, f => sub { mscore_f( $str )}, ); for my $L ( 1e3, 1e4, 1e5, 1e6, 1e7, 1e8 ) { say "\nString length: " . $L =~ s/(\d)(?=(\d{3})+$)/$1,/gr; $str = '1' x $L; substr $str, rand $L, 1 , '0' for 1 .. $L; delete $tests{ perl } if $L > 1e6; delete $tests{ pdl } if $L > 1e7; cmpthese -1, \%tests; } __END__ v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 1,000 Rate perl pdl c c_better f c2 +f perl 6637/s -- -90% -99% -99% -99% -99 +% pdl 67389/s 915% -- -91% -92% -92% -93 +% c 764229/s 11415% 1034% -- -8% -13% -16 +% c_better 830711/s 12416% 1133% 9% -- -5% -9 +% f 876596/s 13108% 1201% 15% 6% -- -4 +% c2f 911910/s 13640% 1253% 19% 10% 4% - +- String length: 10,000 Rate perl pdl c c_better c2f +f perl 635/s -- -94% -99% -99% -99% -99 +% pdl 11113/s 1651% -- -87% -88% -89% -90 +% c 86164/s 13476% 675% -- -9% -18% -26 +% c_better 94730/s 14826% 752% 10% -- -9% -19 +% c2f 104655/s 16390% 842% 21% 10% -- -10 +% f 116531/s 18261% 949% 35% 23% 11% - +- String length: 100,000 Rate perl pdl c c_better c2f f perl 63.1/s -- -95% -99% -99% -99% -99% pdl 1161/s 1741% -- -87% -88% -89% -91% c 8934/s 14061% 669% -- -8% -18% -28% c_better 9692/s 15262% 734% 8% -- -11% -22% c2f 10925/s 17216% 841% 22% 13% -- -12% f 12350/s 19475% 963% 38% 27% 13% -- String length: 1,000,000 Rate perl pdl f c c_better c2f perl 6.31/s -- -94% -99% -99% -99% -99% pdl 104/s 1554% -- -80% -88% -89% -90% f 535/s 8370% 412% -- -38% -45% -51% c 865/s 13600% 728% 62% -- -11% -20% c_better 968/s 15243% 828% 81% 12% -- -11% c2f 1085/s 17095% 940% 103% 26% 12% -- String length: 10,000,000 Rate pdl f c c_better c2f pdl 10.1/s -- -76% -89% -90% -90% f 41.5/s 311% -- -53% -57% -61% c 88.7/s 778% 114% -- -8% -16% c_better 96.4/s 854% 132% 9% -- -9% c2f 106/s 947% 155% 19% 10% -- String length: 100,000,000 Rate f c c_better c2f f 4.39/s -- -51% -54% -58% c 8.87/s 102% -- -7% -14% c_better 9.55/s 118% 8% -- -8% c2f 10.3/s 136% 17% 8% --
Max_score_subs.pm:
package Max_score_subs; use strict; use warnings; use feature 'say'; use PDL; use Exporter 'import'; our %EXPORT_TAGS = ( all => [ qw/ mscore_perl mscore_pdl mscore_c mscore_c_better mscore_c2f mscore_f /], ); Exporter::export_ok_tags( 'all' ); sub mscore_perl { my $s = shift; my $max = '0' eq substr $s, 0, 1; $s = substr $s, 1; $max += $s =~ tr/1//; chop $s; my $score = $max; while ( $s =~ /(0*)(1*)/g ) { $score += length $1; $max = $score if $score > $max; $score -= length $2 } return $max } sub mscore_pdl { my $s = shift; my $count = $s =~ tr/1//; chop $s; $s =~ tr/01/\xff\1/; my $p = zeroes sbyte, length $s; ${ $p-> get_dataref } = $s; $p-> upd_data; $count - $p-> cumusumover -> minover -> sclr } use FindBin; use Inline Config => force_build => 1; use Inline C => Config => libs => "-L$FindBin::Bin -lmscore_f -lgfortran -lquadmath"; use Inline C => << 'END_OF_C'; extern void c2f_( void* n, void* s, void* ret ); extern void f_( void* n, void* s, void* ret ); int mscore_c( SV* str ) { int i, acc, score, max, one; STRLEN len; char* buf = SvPV( str, len ); len --; acc = 0; score = 0; max = -1; for( i = 0; i < len; i ++ ) { one = buf[ i ] - '0'; acc += one; score += one ? -1 : 1; max = score > max ? score : max; } return max + acc + ( buf[ i ] == '1' ); } int mscore_c_better( SV* str ) { int i, acc, tmp, max; STRLEN len; char* buf = SvPVbyte( str, len ); len --; acc = 0; max = -2; for( i = 0; i < len; i ++ ) { acc += buf[ i ] - '0'; tmp = i - 2 * acc; max = tmp > max ? tmp : max; } return max + acc + ( buf[ len ] == '1' ) + 1; } int mscore_c2f( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ret; c2f_( &len, buf, &ret ); return ret; } int mscore_f( SV* str ) { STRLEN len; char* buf = SvPVbyte( str, len ); int ret; f_( &len, buf, &ret ); return ret; } END_OF_C 1;
mscore_f.f90:
subroutine c2f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: i, acc, mx, tmp acc = 0 mx = -2 do i = 1, n - 1 acc = acc + ( s( i ) - iachar( '0', 1 )) tmp = i - 1 - 2 * acc mx = max( mx, tmp ) end do ret = acc + mx + ( s( n ) - iachar( '0', 1 )) + 1 end subroutine c2f subroutine f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1 ! 97 integer :: i integer * 1 :: a( n ) ! temporary pad integer :: cum( n ) ! cumulative sums ret = count( s == iachar( '1', 1 )) ! tr/1// a = 2 * s - A97 ! tr/01/\xff\1/r a( n ) = 0 ! "chop", kind of cum( 1 ) = a( 1 ) do i = 2, n cum( i ) = cum( i - 1 ) + a( i ) end do ret = ret - minval( cum ) end subroutine f
The jitter for fast contestants for very short (1e3) input is perhaps within margin of error; but what about "f" getting slower for 1e6 and above? I can only assume, memory for arrays declared within a subroutine is allocated per call (?? I don't know); this doesn't apply of course to dummy arrays i.e. sub arguments (passed by reference). The scratch-pads, i.e. "a" and "cum", were declared to have not explicit, but "n" size; does it matter anything?
Anyway, it then occurred to me, the scratch-pads can be much smaller and re-used many times, both for a single call (i.e. cumulative sums and minimums perhaps to be done in (quite) a few steps, following the consecutive nature of the problem all the same) and between calls, because declared as module variables i.e. the "mscore_f.f90" re-written as a module, per modern fashion:
module mscore_f implicit none private integer, parameter :: A97 = 2 * iachar( '0', 1 ) + 1 ! 97 integer, parameter :: CHUNK = 25000 integer * 1 :: a( CHUNK ) ! scratch-pad #1 integer :: cum( CHUNK ) ! scratch-pad #2 public :: f, c2f contains subroutine c2f( n, s, ret ) ! skipped, the same code end subroutine c2f subroutine f( n, s, ret ) integer, intent( in ) :: n integer * 1, intent( in ) :: s( n ) integer, intent( out ) :: ret integer :: i, j, m, pre, L integer :: rprev ! "running previous" integer :: rmin ! "running minimum" ret = count( s == iachar( '1', 1 )) m = n / CHUNK if ( mod( n, CHUNK ) .ne. 0 ) then m = m + 1 end if rprev = 0; rmin = 2; do j = 1, m pre = ( j - 1 ) * CHUNK L = min( CHUNK, ubound( s, 1 ) - pre - 1 ) a( 1 : L ) = 2 * s( pre + 1 : pre + L ) - A97 cum( 1 ) = rprev + a( 1 ) do i = 2, L cum( i ) = cum( i - 1 ) + a( i ) end do rmin = min( rmin, minval( cum( 1 : L ))) rprev = cum( L ) end do ret = ret - rmin end subroutine f end module mscore_f
And within Max_score_subs.pm:
extern void __mscore_f_MOD_c2f( void* n, void* s, void* ret ); extern void __mscore_f_MOD_f( void* n, void* s, void* ret );
Then, with relevant contestants, the results were quite a wow moment for me:
v5.42.0 / MSWin32-x64-multi-thread / 13.2.0 String length: 1,000 Rate c_better c2f f c_better 829929/s -- -8% -17% c2f 905443/s 9% -- -10% f 1003704/s 21% 11% -- String length: 10,000 Rate c_better c2f f c_better 94179/s -- -11% -22% c2f 105326/s 12% -- -13% f 120618/s 28% 15% -- String length: 100,000 Rate c_better c2f f c_better 9683/s -- -12% -22% c2f 10983/s 13% -- -12% f 12418/s 28% 13% -- String length: 1,000,000 Rate c_better c2f f c_better 969/s -- -10% -22% c2f 1082/s 12% -- -13% f 1246/s 28% 15% -- String length: 10,000,000 Rate c_better c2f f c_better 95.1/s -- -9% -20% c2f 104/s 10% -- -12% f 119/s 25% 14% -- String length: 100,000,000 Rate c_better c2f f c_better 9.55/s -- -9% -18% c2f 10.5/s 10% -- -10% f 11.6/s 22% 11% --
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Weird performance issue with Strawberries and Inline::C
by Anonymous Monk on Oct 27, 2025 at 11:46 UTC | |
by Anonymous Monk on Nov 06, 2025 at 11:39 UTC | |
by Anonymous Monk on Nov 09, 2025 at 17:41 UTC | |
by Anonymous Monk on Nov 12, 2025 at 13:53 UTC | |
by Anonymous Monk on Nov 17, 2025 at 14:54 UTC |