Re^10: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)

Replies are listed 'Best First'.
Re^11: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on May 21, 2022 at 01:16 UTC
First, I disabled SMT (hyperthreading) to ensure two threads do not run on a physical core. Next, I increased chunk size from 40,000 to 200,000 to better understand the time gap between PDL 2.021 and PDL 2.079. `PDL 2.021: perl -d:NYTProf demo_win.pl 1e7 # 5.238 secs. PDL 2.079: perl -d:NYTProf demo_win.pl 1e7 # 9.511 secs.` [download] There are many subroutine calls using PDL 2.079, not present using PDL 2.021. This is not the reason for the slowness, but simply noting due to the high number of calls. Well, I reverted the File::Which change locally to be sure not the reason. `Calls P F ExTime InTime Subroutine 61776 1 1 186ms 186ms File::Which::CORE::regcomp (opcode) 18 2 1 119ms 394ms File::Which::which 5616 1 1 66.6ms 66.6ms File::Which::CORE:ftdir (opcode) 61776 1 1 15.9ms 15.9ms File::Which::CORE:match (opcode) 5616 1 1 800盜 800盜 File::Which::CORE:fteexec (opcode) 5148 1 1 668盜 668盜 File::Which::CORE:ftis (opcode)` [download] Testing was done using Strawberry Perl v5.32.1.1 - PDL edition. I extracted the bundle twice to C:\perl-5.32.0.1-PDL and C:\perl-5.32.0.1-recent (updated PDL from 2.021 to 2.079 - that is obtaining PDL 2.079 and run perl Makefile.PL followed by gmake install). I'm hoping that someone on the PDL team can take similar steps to determine the issue. I ran with 8 workers in demo_win.pl. The slowness is also present on Linux. PDL 2.021 (2.190 secs) vs PDL 2.079 (2.913 secs). Modules `File::Map 0.67 MCE 1.878` [download] Update:The following is a test script factoring out MCE, File::Map, and PDL::IO::FastRaw. use strict; use warnings; use feature 'say'; use PDL; use Time::HiRes 'time'; { no warnings 'once'; $PDL::BIGPDL = 1; eval q{ PDL::set_autopthread_targ(1) }; } use constant MAX => shift \|\| 500000; use constant MAXLEN => MAX * 1; my $t = time; my $lengths = ones( short, 3 + MAXLEN ); $lengths-> inplace-> setvaltobad( 1 ); $lengths-> set( 1, 1 ); $lengths-> set( 2, 2 ); $lengths-> set( 4, 3 ); my ($from, $to) = (0, MAX); my $seqs_c = $from + sequence( longlong, $to - $from + 1 ); $seqs_c-> setbadat( 0 ); $seqs_c-> setbadat( 1 ); $seqs_c-> badvalue( 2 ); my $lengths_c = $lengths-> slice([ $from, $to ]); my $current = zeroes( short, nelem( $seqs_c )); while ( any $seqs_c-> isgood ) { my ( $seqs_c_odd, $current_odd_masked ) = where( $seqs_c, $current, $seqs_c & 1 ); $current_odd_masked ++; $current ++; ( $seqs_c_odd = 3 ) ++; $seqs_c >>= 1; my ( $seqs_cap, $lengths_cap, $current_cap ) = where( $seqs_c, $lengths_c, $current, $seqs_c <= MAXLEN ); my $lut = $lengths-> index( $seqs_cap ); # "_f" is for "finished" my ( $seqs_f, $lengths_f, $lut_f, $current_f ) = where( $seqs_cap, $lengths_cap, $lut, $current_cap, $lut-> isgood ); $lengths_f .= $lut_f + $current_f; $seqs_f .= 2; # i.e. BAD } say {STDERR} time - $t; [download] PDL 2.021 is noticeably faster than PDL 2.079. `$PDL::BIGPDL = 1; PDL 2.021: perl test.pl 2e6 # 4.590 secs. PDL 2.079: perl test.pl 2e6 # 7.252 secs. # $PDL::BIGPDL = 1; # line commented out PDL 2.021: perl test.pl 2e6 # 4.490 secs. PDL 2.079: perl test.pl 2e6 # 5.252 secs.` [download]	[reply] [d/l] [select]
Re^12: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Priest) on May 22, 2022 at 00:22 UTC
Thanks, this is amazing work! How much work would it be to try each of the operations within the script some suitable number of times, on each PDL version (or at least `isgood`), in order to see if any of them stand out as the one that's got slower? (or conceivably it's across the board). Immediate surmise: there have been updates to the badvalue detection functionality to allow NaN to be used as a badvalue in 2.040 and fixes to that in 2.064. Ergo, if you do have capacity to performance-test at least `isgood` and see that's what's slowed down, that will be at least possible to fix. One approach there would be to break out the "badvalue is NaN" branch out of the current broadcast loop into its own loop to avoid constantly checking a value that doesn't change within nearly all operations (the badvalue) to see if it's still (or still not) a NaN.	[reply] [d/l] [select]
Re^13: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on May 22, 2022 at 00:39 UTC
Greetings, etj. My initial post was simply heads up with regards to slowness detected. I left the test code for you; or if you like, for the PDL team. If you haven't already, you'd find interesting things using Devel::NYTProf. A lot of subroutines are noticeably slower compared to PDL 2.021. The test script provides a starting point.	[reply]
Re^14: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Priest) on May 22, 2022 at 00:47 UTC
Re^15: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Priest) on May 31, 2022 at 22:47 UTC
Re^15: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on May 22, 2022 at 00:50 UTC