http://qs1969.pair.com?node_id=1216775


in reply to Re^3: Inline::C on Windows: how to improve performance of compiled code?
in thread Inline::C on Windows: how to improve performance of compiled code?

ikegami, yes, both Perls are threaded 5.26 (it's in OP).

syphilis, thanks a lot for your research. I still hope, C code can be compiled into "faster version" i.e. with speed on par with unthreaded or Linux Perls. Core XS modules don't run 2 times slower on Windows.

Here's a little experiment. The hv_store (Hash::Util) is documented as

hv_store() is from Array::RefElem, Copyright 2000 Gisle Aas

(BTW, as an aside, some funny and interesting things can be done with aliases stored as aggregate elements, it seems references overshadowed this feature which has been there for free for decades, no new experimental "refaliasing" required for that (it's not surprising since they are "aggregate elements", not regular variables). But you know all that.)

So, a couple weeks ago I accidentally found Hash::Util::hv_store is almost exactly 2 times faster than Array::RefElem::hv_store. I thought it strange but didn't investigate it then. Now:

>perl -MTime::HiRes=time -MHash::Util=hv_store -wE"%h=();$t=time;hv_st +ore(%h,'foo',42) for 1..1e8;say time-$t" 10.8255708217621 >perl -MTime::HiRes=time -MArray::RefElem=hv_store -wE"%h=();$t=time;h +v_store(%h,'foo',42) for 1..1e8;say time-$t" 20.3003470897675 $ perl -MTime::HiRes=time -MHash::Util=hv_store -wE'%h=();$t=time;hv_s +tore(%h,"foo",42) for 1..1e8;say time-$t' 10.5545630455017 $ perl -MTime::HiRes=time -MArray::RefElem=hv_store -wE'%h=();$t=time; +hv_store(%h,"foo",42) for 1..1e8;say time-$t' 12.1946179866791

The question is whether it's possible to compile anything (Array::RefElem or inline C) to be fast on threaded Win32 Perl. The fact that e.g. Hash::Util exists suggests the answer is "yes". But how? :)

Replies are listed 'Best First'.
Re^5: Inline::C on Windows: how to improve performance of compiled code?
by BrowserUk (Patriarch) on Jun 16, 2018 at 11:32 UTC

    I backed away from this when I saw you weren't using an MS compiler, as I've no experience of gcc/mingw, but it seems to me that this will remain a mystery until you start inspecting the generated code. With MS CL adding /link /FAs to the compiler options cause it to output a .asm file.

    When I run the following:

    #! perl -slw use strict; use Config; print $Config{ ccflags }; use Inline C => Config => BUILD_NOISY => 1, CCFLAGS => $Config{ ccflag +s } . "/link /FAs"; use Inline C => <<'END_C', NAME => '_junk', CLEAN_AFTER_BUILD =>0; int i = 0; void test( SV *sv ) { ++i; return; } int check( SV *sv ) { return i; } END_C use Time::HiRes qw[ time ]; our $N //= 1e6; my $start = time; my $i = 0; $i = test( 1 ) for 1 .. $N; printf "Took %fseconds\n", time() - $start; print check( 1 )

    The assembly code produced for test() is pretty much exactly what you'd expect:

    PUBLIC test ; Function compile flags: /Ogtpy _TEXT SEGMENT sv$ = 8 test PROC ; 10 : ++i; inc DWORD PTR i ; 11 : return; ; 12 : } ret 0 test ENDP _TEXT ENDS

    But then you have to look at the Perl callable wrapper function to see all the overhead that Perl-callability adds:

    _TEXT SEGMENT my_perl$ = 48 cv$ = 56 XS_main_test PROC ; 174 : { mov QWORD PTR [rsp+8], rbx mov QWORD PTR [rsp+16], rsi push rdi sub rsp, 32 ; 00000020H mov rdi, rdx ; 175 : dVAR; dXSARGS; call Perl_get_context mov rcx, rax call Perl_Istack_sp_ptr mov rbx, QWORD PTR [rax] call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr mov rcx, QWORD PTR [rax] add rcx, -4 movsxd rsi, DWORD PTR [rcx+4] mov QWORD PTR [rax], rcx call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr mov rax, QWORD PTR [rax] lea rdx, QWORD PTR [rax+rsi*8] inc esi sub rbx, rdx sar rbx, 3 ; 176 : if (items != 1) cmp ebx, 1 je SHORT $LN8@XS_main_te ; 177 : croak_xs_usage(cv, "sv"); call Perl_get_context lea r8, OFFSET FLAT:??_C@_02CPGMCOJE@sv?$AA@ mov rdx, rdi mov rcx, rax call Perl_croak_xs_usage $LN8@XS_main_te: ; 178 : PERL_UNUSED_VAR(ax); /* -Wall */ ; 179 : SP -= items; ; 180 : { ; 181 : SV * sv = ST(0) ; 182 : ; call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr ; File c:\test\_inline\build\_junk\_junk.xs ; 30 : temp = PL_markstack_ptr++; call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr ; 31 : test(sv); inc DWORD PTR i mov rbx, QWORD PTR [rax] lea rcx, QWORD PTR [rbx+4] mov QWORD PTR [rax], rcx ; 32 : if (PL_markstack_ptr != temp) { call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr cmp QWORD PTR [rax], rbx je SHORT $LN4@XS_main_te ; 33 : /* truly void, because dXSARGS not invoked */ ; 34 : PL_markstack_ptr = temp; call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr mov QWORD PTR [rax], rbx ; 35 : XSRETURN_EMPTY; /* return empty stack */ call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr movsxd rcx, esi mov rax, QWORD PTR [rax] lea rbx, QWORD PTR [rax+rcx*8-8] call Perl_get_context mov rcx, rax call Perl_Istack_sp_ptr mov QWORD PTR [rax], rbx $LN4@XS_main_te: ; File c:\test\_inline\build\_junk\_junk.c ; 200 : } mov rbx, QWORD PTR [rsp+48] mov rsi, QWORD PTR [rsp+56] add rsp, 32 ; 00000020H pop rdi ret 0 XS_main_test ENDP _TEXT ENDS

    And the real eye-opener comes when start looking at the code behind those call Perl_xxx; littered all over the place. (Why is it necessary to call Perl_get_context() 9 times for EVERY CALL to such a simple function?)

    If you assume that your original empty C stub is actually causing code to be generated and run -- and I don't; I think your call to the empty function is being optimised away --then it would be instructive to see the difference in the code that is being called.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      Why is it necessary to call Perl_get_context() 9 times for EVERY CALL to such a simple function?

      I think that is unnecessary and I would expect that those Perl_get_context() calls could be removed by declaring
      PRE_HEAD => '#define PERL_NO_GET_CONTEXT 1',
      in your scripts "Config" section.

      If I don't define PERL_NO_GET_CONTEXT, then for me your script outputs:
      Took 0.160126seconds 1000000
      With PERL_NO_GET_CONTEXT defined it runs twice as quickly:
      Took 0.072088seconds 1000000
      (I've ignored the CCFLAGS output that is also produced.)
      AIUI, the problem with defining PERL_NO_GET_CONTEXT in Inline::C scripts is that it causes breakage if any of the Inline::C functions call Perl API functions.
      But none of the functions in your script call Perl API functions, so it's ok to define PERL_NO_GET_CONTEXT.

      Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

      Cheers,
      Rob
        (I've ignored the CCFLAGS output that is also produced.)

        Just debug.

        AIUI, the problem with defining PERL_NO_GET_CONTEXT in Inline::C scripts is that it causes breakage if any of the Inline::C functions call Perl API functions. But none of the functions in your script call Perl API functions, so it's ok to define PERL_NO_GET_CONTEXT.

        Indeed. T'is unfortunate that almost every function that does anything useful needs to call at least one perl API.

        It is the case that many, if not all-but-one, of the Perl_get_context() calls get optimised away, but getting your hands on the post-optimised assembly is only possible by using a debugger, and it means relating any bug back to the pre-optimised C is a nightmare.

        Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

        Possibly; but getting his hands on the assembler output would be the surest way of finding out. That has to be possible with gcc/mingw right?

        I still think that the chances are that gcc is optimising his c-stub and perl callable wrapper away completely.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

        "Simply"!? No, it isn't simple :-) Not for me. And yes, with this define, a no-op stub performs equally fast both in Linux and threaded Win32, and time for test in OP is 3.7 sec, while it was ~5 and ~11, respectively. (So, BrowserUk, it looks like this stub wasn't optimized away.) Thank you for link and explanation, now at least I have some idea what's going on. The Hash::Util has this magic incantation as first line of its XS, while Array::RefElem hasn't anywhere, so it explains their different speed, too. My real C code calls SvPV and others, with this define it stops working as explained in link you provided, I'll have to solve this, but, these are details to work out.