Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^5: Inline::C on Windows: how to improve performance of compiled code?

by BrowserUk (Patriarch)
on Jun 16, 2018 at 11:32 UTC ( #1216776=note: print w/replies, xml ) Need Help??


in reply to Re^4: Inline::C on Windows: how to improve performance of compiled code?
in thread Inline::C on Windows: how to improve performance of compiled code?

I backed away from this when I saw you weren't using an MS compiler, as I've no experience of gcc/mingw, but it seems to me that this will remain a mystery until you start inspecting the generated code. With MS CL adding /link /FAs to the compiler options cause it to output a .asm file.

When I run the following:

#! perl -slw use strict; use Config; print $Config{ ccflags }; use Inline C => Config => BUILD_NOISY => 1, CCFLAGS => $Config{ ccflag +s } . "/link /FAs"; use Inline C => <<'END_C', NAME => '_junk', CLEAN_AFTER_BUILD =>0; int i = 0; void test( SV *sv ) { ++i; return; } int check( SV *sv ) { return i; } END_C use Time::HiRes qw[ time ]; our $N //= 1e6; my $start = time; my $i = 0; $i = test( 1 ) for 1 .. $N; printf "Took %fseconds\n", time() - $start; print check( 1 )

The assembly code produced for test() is pretty much exactly what you'd expect:

PUBLIC test ; Function compile flags: /Ogtpy _TEXT SEGMENT sv$ = 8 test PROC ; 10 : ++i; inc DWORD PTR i ; 11 : return; ; 12 : } ret 0 test ENDP _TEXT ENDS

But then you have to look at the Perl callable wrapper function to see all the overhead that Perl-callability adds:

_TEXT SEGMENT my_perl$ = 48 cv$ = 56 XS_main_test PROC ; 174 : { mov QWORD PTR [rsp+8], rbx mov QWORD PTR [rsp+16], rsi push rdi sub rsp, 32 ; 00000020H mov rdi, rdx ; 175 : dVAR; dXSARGS; call Perl_get_context mov rcx, rax call Perl_Istack_sp_ptr mov rbx, QWORD PTR [rax] call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr mov rcx, QWORD PTR [rax] add rcx, -4 movsxd rsi, DWORD PTR [rcx+4] mov QWORD PTR [rax], rcx call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr mov rax, QWORD PTR [rax] lea rdx, QWORD PTR [rax+rsi*8] inc esi sub rbx, rdx sar rbx, 3 ; 176 : if (items != 1) cmp ebx, 1 je SHORT $LN8@XS_main_te ; 177 : croak_xs_usage(cv, "sv"); call Perl_get_context lea r8, OFFSET FLAT:??_C@_02CPGMCOJE@sv?$AA@ mov rdx, rdi mov rcx, rax call Perl_croak_xs_usage $LN8@XS_main_te: ; 178 : PERL_UNUSED_VAR(ax); /* -Wall */ ; 179 : SP -= items; ; 180 : { ; 181 : SV * sv = ST(0) ; 182 : ; call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr ; File c:\test\_inline\build\_junk\_junk.xs ; 30 : temp = PL_markstack_ptr++; call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr ; 31 : test(sv); inc DWORD PTR i mov rbx, QWORD PTR [rax] lea rcx, QWORD PTR [rbx+4] mov QWORD PTR [rax], rcx ; 32 : if (PL_markstack_ptr != temp) { call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr cmp QWORD PTR [rax], rbx je SHORT $LN4@XS_main_te ; 33 : /* truly void, because dXSARGS not invoked */ ; 34 : PL_markstack_ptr = temp; call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr mov QWORD PTR [rax], rbx ; 35 : XSRETURN_EMPTY; /* return empty stack */ call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr movsxd rcx, esi mov rax, QWORD PTR [rax] lea rbx, QWORD PTR [rax+rcx*8-8] call Perl_get_context mov rcx, rax call Perl_Istack_sp_ptr mov QWORD PTR [rax], rbx $LN4@XS_main_te: ; File c:\test\_inline\build\_junk\_junk.c ; 200 : } mov rbx, QWORD PTR [rsp+48] mov rsi, QWORD PTR [rsp+56] add rsp, 32 ; 00000020H pop rdi ret 0 XS_main_test ENDP _TEXT ENDS

And the real eye-opener comes when start looking at the code behind those call Perl_xxx; littered all over the place. (Why is it necessary to call Perl_get_context() 9 times for EVERY CALL to such a simple function?)

If you assume that your original empty C stub is actually causing code to be generated and run -- and I don't; I think your call to the empty function is being optimised away --then it would be instructive to see the difference in the code that is being called.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

Replies are listed 'Best First'.
Re^6: Inline::C on Windows: how to improve performance of compiled code?
by syphilis (Archbishop) on Jun 16, 2018 at 14:55 UTC
    Why is it necessary to call Perl_get_context() 9 times for EVERY CALL to such a simple function?

    I think that is unnecessary and I would expect that those Perl_get_context() calls could be removed by declaring
    PRE_HEAD => '#define PERL_NO_GET_CONTEXT 1',
    in your scripts "Config" section.

    If I don't define PERL_NO_GET_CONTEXT, then for me your script outputs:
    Took 0.160126seconds 1000000
    With PERL_NO_GET_CONTEXT defined it runs twice as quickly:
    Took 0.072088seconds 1000000
    (I've ignored the CCFLAGS output that is also produced.)
    AIUI, the problem with defining PERL_NO_GET_CONTEXT in Inline::C scripts is that it causes breakage if any of the Inline::C functions call Perl API functions.
    But none of the functions in your script call Perl API functions, so it's ok to define PERL_NO_GET_CONTEXT.

    Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

    Cheers,
    Rob
      (I've ignored the CCFLAGS output that is also produced.)

      Just debug.

      AIUI, the problem with defining PERL_NO_GET_CONTEXT in Inline::C scripts is that it causes breakage if any of the Inline::C functions call Perl API functions. But none of the functions in your script call Perl API functions, so it's ok to define PERL_NO_GET_CONTEXT.

      Indeed. T'is unfortunate that almost every function that does anything useful needs to call at least one perl API.

      It is the case that many, if not all-but-one, of the Perl_get_context() calls get optimised away, but getting your hands on the post-optimised assembly is only possible by using a debugger, and it means relating any bug back to the pre-optimised C is a nightmare.

      Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

      Possibly; but getting his hands on the assembler output would be the surest way of finding out. That has to be possible with gcc/mingw right?

      I still think that the chances are that gcc is optimising his c-stub and perl callable wrapper away completely.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        T'is unfortunate that almost every function that does anything useful needs to call at least one perl API

        Indeed ... though it's also often the case that many functions that call the Perl API don't really need to.
        Here's a simplistic example:
        use strict; use warnings; use Inline C => Config => PRE_HEAD => '#define PERL_NO_GET_CONTEXT 1', ; use Inline C => <<'EOC'; /* SV * foo(int x) { return newSViv(x); } */ int foo(int x) { return x; } EOC my $x = foo(-1234);https://perlconference.us/tpc-2018-slc/ print $x;
        Both renditions of foo() do essentially the same thing.
        But the rendition that has been commented out won't work when PERL_NO_GET_CONTEXT is defined, whereas the other rendition will.

        So there are possibilities even with the current Inline::C, depending upon how much time and energy you're prepared to devote in order to avoid the Perl API.
        But mostly, it's not worth the effort.
        (Of course, ideally you wouldn't even have to concern yourself with such matters when using Inline::C - and Ingy has indicated (in the link I provided earlier) that this might all be fixed in Inline::C following the Perl Conference that begins in the next day or so.
        In the meantime, if you want to define PERL_NO_GET_CONTEXT, then I think you're generally going to have to create an XS module.

        Cheers,
        Rob
      Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

      "Simply"!? No, it isn't simple :-) Not for me. And yes, with this define, a no-op stub performs equally fast both in Linux and threaded Win32, and time for test in OP is 3.7 sec, while it was ~5 and ~11, respectively. (So, BrowserUk, it looks like this stub wasn't optimized away.) Thank you for link and explanation, now at least I have some idea what's going on. The Hash::Util has this magic incantation as first line of its XS, while Array::RefElem hasn't anywhere, so it explains their different speed, too. My real C code calls SvPV and others, with this define it stops working as explained in link you provided, I'll have to solve this, but, these are details to work out.

        (So, BrowserUk, it looks like this stub wasn't optimized away.)

        Hm. If you look above, you'll see that the 'call' from the XS wrapper to void test( SV *sv ) { ++i; } gets inlined to just 1 instruction:

        67 ; 31 : test(sv); 68 69 inc DWORD PTR i

        However, defining PERL_NO_GET_CONTEXT doesn't change a thing in the generated assembler. Of course, that is pre-optimisation code, so your timings may be a better indicator.

        That said, I think you would be better off looking at ways to try and move some or all of your loop into C, rather than trying to optimise the calls from Perl to C.

        What I mean is, if you are calling from Perl -> C 10e8 times, then your Perl code must consist of one or more loops. Whilst there is obviously some savings to be had by minimising the perl -> C -> perl transitions, there is (probably) a much larger saving to be had by moving the loop into C and avoiding all/or a large number of those transitions.

        As an extreme example, the deBruijn sequence generator I recently ported from Python to Perl takes 1587 seconds to generate the de Bruijn sequence for 8-char substrings from a 10-char alphabet; but when ported to C, that drops to 0.57 seconds ( a 99.96% reduction!):

        C:\test>DeBruijnX -N=8 -ALPHA=0123456789 Took: 1586.944328 secs 100000000 Took: 0.579065 secs 100000000

        And a very large part of that massive saving is avoiding the perl function call overhead of the 16 million recursive function calls involved:

        #! perl -slw use strict; # use Config; print $Config{ ccflags }; use Inline C => Config => BUILD_NOISY => 1; #, CCFLAGS => $Config{ ccf +lags } . "/link /FAs"; use Inline C => <<'END_C', NAME => '_deBruijn', CLEAN_AFTER_BUILD =>0 +; #define PERL_NO_GET_CONTEXT 1 int n, iseq; STRLEN k; char *seq, *a; void dbc( int t, int p ) { int i; if( t > n ) { if( n % p == 0 ) for( i = 1; i <= p; ++i ) seq[ iseq++ ] = a[ i ]; } else { a[ t ] = a[ t - p ]; dbc( t+1, p ); for( i = a[ (t - p) ] + 1; i < k; ++i ) { a[ t ] = i; dbc( t+1, t ); } } } SV *deBruijnC( SV *svAlphabet, SV *len ) { int i; char *alphabet = SvPV( svAlphabet, k ); n = (int)SvIV( len ); iseq = 0; Newxz( seq, (int)pow( (double)k, (double)n), char ); Newxz( a, k * n, char ); dbc( 1, 1 ); for( i = 0; i < iseq ; ++i ) { seq[ i ] = alphabet[ seq[ i ] ]; } return newSVpv( seq, iseq ); } END_C

        Defining PERL_NO_GET_CONTEXT doesn't stop it from running, but it doesn't improve performance one iota.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1216776]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2022-08-10 11:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?