in reply to Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
in thread Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)

Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDI-ish, and range is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, eh-hm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment:

...-> sumover-> sumover

which you replaced with

...-> clump(2)-> sumover

(Frankly, it's obvious I think that huge speed difference of Game-of-Life implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only, -- but perhaps that's how tutorials work.)

Consider:

sub sms_WxH_PDL_range ( $m, $w, $h ) { my ( $W, $H ) = $m-> dims; $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ]) -> reorder( 2, 3, 0, 1 ) -> clump( 2 ) -> sumover } sub sms_WxH_PDL_range_b ( $m, $w, $h ) { my ( $W, $H ) = $m-> dims; $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ]) -> reorder( 2, 3, 0, 1 ) -> sumover -> sumover } __END__ Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix) +-----------------------------------------------------------+ |+ + + + + + | 1.6 |-+ A +-| | | | | 1.4 |-+ +-| | | 1.2 |-+ +-| | A | | | 1 |-+ +-| | | | B | 0.8 |-+ +-| | A | | | 0.6 |-+ B +-| | | 0.4 |-+ A +-| | B | | A | 0.2 |-+ A B +-| | A B B | | A B B D D | 0 |-+ D D D D D D D D D C C +-| |+ + + + + + | +-----------------------------------------------------------+ 0 5 10 15 20 25 sms_WxH_PDL_range A sms_WxH_PDL_range_b B sms_WxH_PDL_lags C sms_WxH_PDL_naive D +----+-------+-------+-------+-------+ | N | A | B | C | D | +----+-------+-------+-------+-------+ | 2 | 0.015 | 0.008 | 0.000 | 0.000 | | 3 | 0.021 | 0.018 | 0.000 | 0.000 | | 4 | 0.044 | 0.021 | 0.000 | 0.000 | | 5 | 0.073 | 0.047 | 0.000 | 0.003 | | 6 | 0.101 | 0.060 | 0.000 | 0.000 | | 8 | 0.193 | 0.104 | 0.000 | 0.005 | | 10 | 0.294 | 0.138 | 0.000 | 0.005 | | 12 | 0.435 | 0.232 | 0.000 | 0.010 | | 16 | 0.711 | 0.344 | 0.000 | 0.015 | | 20 | 1.115 | 0.549 | 0.000 | 0.026 | | 25 | 1.573 | 0.828 | 0.000 | 0.047 | +----+-------+-------+-------+-------+

(I took liberty to use couple of numbers as args to ndcoords instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though ndcoords builds a relatively large ndarray to feed to range, I think range is simply not written with speed/performance as its goal.

It's actually tempting to try to improve Game of Life PDL implementation from the tutorial:

use strict; use warnings; use experimental qw/ say postderef signatures /; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant STEPS => 100; my $x = zeroes( 200, 200 ); # Put in a simple glider. $x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my $backup = $x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", $x-> info, STEPS; # Tutorial my $t = time; my $ct = 0; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1); $n = $n->sumover->sumover - $x; $ct += time - $t_; # Calculate the next generation. $x = ((($n == 2) + ($n == 3))* $x) + (($n==3) * !$x); } printf "Tutorial: %0.3f s (core time: %0.3f)\n", time - $t, $ct; # "Lags" my $m = $backup-> copy; $t = time; $ct = 0; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = sms_GoL_lags( $m ) - $m; $ct += time - $t_; # Calculate the next generation. $m = ((($n == 2) + ($n == 3))* $m) + (($n == 3) * !$m); } printf "\"lags\": %0.3f s (core time: %0.3f)\n", time - $t, $ct; die unless eq_pdl( $x, $m ); sub _do_dimension_GoL ( $m ) { $m-> slice( -1 )-> glue( 0, $m, $m-> slice( 0 )) -> lags( 0, 1, ( $m-> dims )[0] ) -> sumover -> slice( '', '-1:0' ) -> xchg( 0, 1 ) } sub sms_GoL_lags ( $m ) { _do_dimension_GoL _do_dimension_GoL $m } __END__ Game of Life! Matrix: PDL: Double D [200,200], 100 generations Tutorial: 1.016 s (core time: 0.835) "lags": 0.283 s (core time: 0.108)

Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out -- because next gen calculation is not very efficient (e.g. $n == 3 array is built twice), but that's another story -- which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.

Replies are listed 'Best First'.
Re^3: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by jo37 (Curate) on Dec 31, 2023 at 17:25 UTC

    Hi Steve (assuming it's you),

    I haven't spent any effort on optimizing the solution. Thank you for running the tests! It is interesting to see that ->clump(2)->sumover doesn't perform as good as ->sumover->sumover. Actually I didn't look up the example in PDL::Broadcasting but wrote my solution from memory. Looks like I'm getting older :-)

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re^3: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)
by Anonymous Monk on Jan 02, 2024 at 12:24 UTC

    Here is last (hopefully) instalment in this thread; it only concerns faster, compared to the PDL CPAN tutorial, "Game of Life" implementation. But maybe some features used are too advanced for a tutorial. Improvement comes from:

    • To achieve wrap around, the ugly glueing in parent node is replaced with dicing along each axis. Very convenient function.
    • Horrible expression to compute next generation (requires 8 operations on whole arrays) is replaced with shorter version (just 4).
    use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant { WIDTH => 20, HEIGHT => 20, STEPS => 1000, }; my $x = zeroes long, WIDTH, HEIGHT; # Put in a simple glider. $x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my $backup = $x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", $x-> info, STEPS; # Tutorial my $t = time; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1); $n = $n->sumover->sumover - $x; # Calculate the next generation. $x = ((($n == 2) + ($n == 3))* $x) + (($n == 3) * !$x); } printf "Tutorial: %0.3f s\n", time - $t; # Faster my $m = $backup-> copy; $t = time; my $wrap_w = pdl [ reverse WIDTH - 1, ( 0 .. WIDTH - 1 ), 0 ]; my $wrap_h = pdl [ reverse HEIGHT - 1, ( 0 .. HEIGHT - 1 ), 0 ]; for ( 1 .. STEPS ) { my $n = $m -> dice_axis( 0, $wrap_w ) -> lags( 0, 1, WIDTH ) -> sumover -> dice_axis( 1, $wrap_h ) -> lags( 1, 1, HEIGHT ) -> xchg( 0, 1 ) -> sumover; $n -= $m; $m = ( $n == 3 ) | $m & ( $n == 2 ) } printf "Faster: %0.3f s\n", time - $t; die unless eq_pdl( $x, $m ); __END__ Game of Life! Matrix: PDL: Long D [20,20], 1000 generations Tutorial: 0.341 s Faster: 0.111 s Matrix: PDL: Long D [200,200], 100 generations Tutorial: 0.845 s Faster: 0.086 s Matrix: PDL: Long D [1000,1000], 20 generations Tutorial: 4.422 s Faster: 0.443 s Matrix: PDL: Long D [2000,2000], 10 generations Tutorial: 8.878 s Faster: 0.872 s

      Oops, sorry, one more: subtraction is not required of course, and, more importantly, next gen calculation expression can be split in 2 lines -- same 4 ops but presumably done in-place i.e. faster:

      $m &= $n == 4; $m |= $n == 3; __END__ Game of Life! Matrix: PDL: Long D [2000,2000], 10 generations Faster+: 0.677 s