Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^3: Inline::C on Windows: how to improve performance of compiled code?

by syphilis (Archbishop)
on Jun 16, 2018 at 05:43 UTC ( #1216771=note: print w/replies, xml ) Need Help??


in reply to Re^2: Inline::C on Windows: how to improve performance of compiled code?
in thread Inline::C on Windows: how to improve performance of compiled code?

One can successfully force install Inline::C, but it's unusable

While that's so for the latest version of Inline::C, it's quite simple to install older versions of Inline::C on the unthreaded Windows (as they don't carry the Win32::IPC baggage that comes with recent versions).
So, I installed Inline-0.55, though I perhaps didn't need to go that far back.
Here are the results using vr's original one-liners:

On threaded perl-5.26.0 with current Inline::C version 0.78:
>perl -MTime::HiRes=time -MInline=C,"void foo(){}" -wE"$t=time;foo()fo +r 1..1e8;say time-$t" 11.0136189460754 >perl -MTime::HiRes=time -wE"$t=time;sub foo(){}foo()for 1..1e8;say ti +me-$t" 5.39761018753052


On threaded perl-5.26.0 with Inline::C version 0.55:
>perl -MTime::HiRes=time -MInline=C,"void foo(){}" -wE"$t=time;foo()fo +r 1..1e8;say time-$t" 10.4052181243896 >perl -MTime::HiRes=time -wE"$t=time;sub foo(){}foo()for 1..1e8;say ti +me-$t" 5.58481001853943


On unthreaded perl-5.26.0 with Inline::C version 0.55:
>perl -MTime::HiRes=time -MInline=C,"void foo(){}" -wE"$t=time;foo()fo +r 1..1e8;say time-$t" 4.92960906028748 >perl -MTime::HiRes=time -wE"$t=time;sub foo(){}foo()for 1..1e8;say ti +me-$t" 7.65961289405823
It therefore appears that reverting to an older version of Inline::C makes very little difference, whereas using Inline::C on an unthreaded Windows perl-5.26.0 markedly improves performance when calling Inline::C subs from perl.
Unfortunately, it also seems that calling perl subs on an unthreaded Windows perl-5.26.0 takes about 30% longer (as compared to the time it takes on the threaded perl).

Of course, things might be quite different on the soon-to-be-released perl-5.28.0.
And things might also be quite different on 32-bit builds of perl.

Cheers,
Rob

Replies are listed 'Best First'.
Re^4: Inline::C on Windows: how to improve performance of compiled code?
by vr (Curate) on Jun 16, 2018 at 10:34 UTC

    ikegami, yes, both Perls are threaded 5.26 (it's in OP).

    syphilis, thanks a lot for your research. I still hope, C code can be compiled into "faster version" i.e. with speed on par with unthreaded or Linux Perls. Core XS modules don't run 2 times slower on Windows.

    Here's a little experiment. The hv_store (Hash::Util) is documented as

    hv_store() is from Array::RefElem, Copyright 2000 Gisle Aas

    (BTW, as an aside, some funny and interesting things can be done with aliases stored as aggregate elements, it seems references overshadowed this feature which has been there for free for decades, no new experimental "refaliasing" required for that (it's not surprising since they are "aggregate elements", not regular variables). But you know all that.)

    So, a couple weeks ago I accidentally found Hash::Util::hv_store is almost exactly 2 times faster than Array::RefElem::hv_store. I thought it strange but didn't investigate it then. Now:

    >perl -MTime::HiRes=time -MHash::Util=hv_store -wE"%h=();$t=time;hv_st +ore(%h,'foo',42) for 1..1e8;say time-$t" 10.8255708217621 >perl -MTime::HiRes=time -MArray::RefElem=hv_store -wE"%h=();$t=time;h +v_store(%h,'foo',42) for 1..1e8;say time-$t" 20.3003470897675 $ perl -MTime::HiRes=time -MHash::Util=hv_store -wE'%h=();$t=time;hv_s +tore(%h,"foo",42) for 1..1e8;say time-$t' 10.5545630455017 $ perl -MTime::HiRes=time -MArray::RefElem=hv_store -wE'%h=();$t=time; +hv_store(%h,"foo",42) for 1..1e8;say time-$t' 12.1946179866791

    The question is whether it's possible to compile anything (Array::RefElem or inline C) to be fast on threaded Win32 Perl. The fact that e.g. Hash::Util exists suggests the answer is "yes". But how? :)

      I backed away from this when I saw you weren't using an MS compiler, as I've no experience of gcc/mingw, but it seems to me that this will remain a mystery until you start inspecting the generated code. With MS CL adding /link /FAs to the compiler options cause it to output a .asm file.

      When I run the following:

      #! perl -slw use strict; use Config; print $Config{ ccflags }; use Inline C => Config => BUILD_NOISY => 1, CCFLAGS => $Config{ ccflag +s } . "/link /FAs"; use Inline C => <<'END_C', NAME => '_junk', CLEAN_AFTER_BUILD =>0; int i = 0; void test( SV *sv ) { ++i; return; } int check( SV *sv ) { return i; } END_C use Time::HiRes qw[ time ]; our $N //= 1e6; my $start = time; my $i = 0; $i = test( 1 ) for 1 .. $N; printf "Took %fseconds\n", time() - $start; print check( 1 )

      The assembly code produced for test() is pretty much exactly what you'd expect:

      PUBLIC test ; Function compile flags: /Ogtpy _TEXT SEGMENT sv$ = 8 test PROC ; 10 : ++i; inc DWORD PTR i ; 11 : return; ; 12 : } ret 0 test ENDP _TEXT ENDS

      But then you have to look at the Perl callable wrapper function to see all the overhead that Perl-callability adds:

      _TEXT SEGMENT my_perl$ = 48 cv$ = 56 XS_main_test PROC ; 174 : { mov QWORD PTR [rsp+8], rbx mov QWORD PTR [rsp+16], rsi push rdi sub rsp, 32 ; 00000020H mov rdi, rdx ; 175 : dVAR; dXSARGS; call Perl_get_context mov rcx, rax call Perl_Istack_sp_ptr mov rbx, QWORD PTR [rax] call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr mov rcx, QWORD PTR [rax] add rcx, -4 movsxd rsi, DWORD PTR [rcx+4] mov QWORD PTR [rax], rcx call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr mov rax, QWORD PTR [rax] lea rdx, QWORD PTR [rax+rsi*8] inc esi sub rbx, rdx sar rbx, 3 ; 176 : if (items != 1) cmp ebx, 1 je SHORT $LN8@XS_main_te ; 177 : croak_xs_usage(cv, "sv"); call Perl_get_context lea r8, OFFSET FLAT:??_C@_02CPGMCOJE@sv?$AA@ mov rdx, rdi mov rcx, rax call Perl_croak_xs_usage $LN8@XS_main_te: ; 178 : PERL_UNUSED_VAR(ax); /* -Wall */ ; 179 : SP -= items; ; 180 : { ; 181 : SV * sv = ST(0) ; 182 : ; call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr ; File c:\test\_inline\build\_junk\_junk.xs ; 30 : temp = PL_markstack_ptr++; call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr ; 31 : test(sv); inc DWORD PTR i mov rbx, QWORD PTR [rax] lea rcx, QWORD PTR [rbx+4] mov QWORD PTR [rax], rcx ; 32 : if (PL_markstack_ptr != temp) { call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr cmp QWORD PTR [rax], rbx je SHORT $LN4@XS_main_te ; 33 : /* truly void, because dXSARGS not invoked */ ; 34 : PL_markstack_ptr = temp; call Perl_get_context mov rcx, rax call Perl_Imarkstack_ptr_ptr mov QWORD PTR [rax], rbx ; 35 : XSRETURN_EMPTY; /* return empty stack */ call Perl_get_context mov rcx, rax call Perl_Istack_base_ptr movsxd rcx, esi mov rax, QWORD PTR [rax] lea rbx, QWORD PTR [rax+rcx*8-8] call Perl_get_context mov rcx, rax call Perl_Istack_sp_ptr mov QWORD PTR [rax], rbx $LN4@XS_main_te: ; File c:\test\_inline\build\_junk\_junk.c ; 200 : } mov rbx, QWORD PTR [rsp+48] mov rsi, QWORD PTR [rsp+56] add rsp, 32 ; 00000020H pop rdi ret 0 XS_main_test ENDP _TEXT ENDS

      And the real eye-opener comes when start looking at the code behind those call Perl_xxx; littered all over the place. (Why is it necessary to call Perl_get_context() 9 times for EVERY CALL to such a simple function?)

      If you assume that your original empty C stub is actually causing code to be generated and run -- and I don't; I think your call to the empty function is being optimised away --then it would be instructive to see the difference in the code that is being called.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        Why is it necessary to call Perl_get_context() 9 times for EVERY CALL to such a simple function?

        I think that is unnecessary and I would expect that those Perl_get_context() calls could be removed by declaring
        PRE_HEAD => '#define PERL_NO_GET_CONTEXT 1',
        in your scripts "Config" section.

        If I don't define PERL_NO_GET_CONTEXT, then for me your script outputs:
        Took 0.160126seconds 1000000
        With PERL_NO_GET_CONTEXT defined it runs twice as quickly:
        Took 0.072088seconds 1000000
        (I've ignored the CCFLAGS output that is also produced.)
        AIUI, the problem with defining PERL_NO_GET_CONTEXT in Inline::C scripts is that it causes breakage if any of the Inline::C functions call Perl API functions.
        But none of the functions in your script call Perl API functions, so it's ok to define PERL_NO_GET_CONTEXT.

        Could it be that the hint that vr is looking for is simply to "define PERL_NO_GET_CONTEXT" ?

        Cheers,
        Rob

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1216771]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2022-08-13 11:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?