Re: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2)

Replies are listed 'Best First'.
Re^2: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2) by Anonymous Monk on Dec 31, 2023 at 16:07 UTC
Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDI-ish, and `range` is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, eh-hm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment: `...-> sumover-> sumover` [download] which you replaced with `...-> clump(2)-> sumover` [download] (Frankly, it's obvious I think that huge speed difference of Game-of-Life implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only, -- but perhaps that's how tutorials work.) Consider: sub sms_WxH_PDL_range ( $m, $w, $h ) { my ( $W, $H ) = $m-> dims; $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ]) -> reorder( 2, 3, 0, 1 ) -> clump( 2 ) -> sumover } sub sms_WxH_PDL_range_b ( $m, $w, $h ) { my ( $W, $H ) = $m-> dims; $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ]) -> reorder( 2, 3, 0, 1 ) -> sumover -> sumover } __END__ Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix) +-----------------------------------------------------------+ \|+ + + + + + \| 1.6 \|-+ A +-\| \| \| \| \| 1.4 \|-+ +-\| \| \| 1.2 \|-+ +-\| \| A \| \| \| 1 \|-+ +-\| \| \| \| B \| 0.8 \|-+ +-\| \| A \| \| \| 0.6 \|-+ B +-\| \| \| 0.4 \|-+ A +-\| \| B \| \| A \| 0.2 \|-+ A B +-\| \| A B B \| \| A B B D D \| 0 \|-+ D D D D D D D D D C C +-\| \|+ + + + + + \| +-----------------------------------------------------------+ 0 5 10 15 20 25 sms_WxH_PDL_range A sms_WxH_PDL_range_b B sms_WxH_PDL_lags C sms_WxH_PDL_naive D +----+-------+-------+-------+-------+ \| N \| A \| B \| C \| D \| +----+-------+-------+-------+-------+ \| 2 \| 0.015 \| 0.008 \| 0.000 \| 0.000 \| \| 3 \| 0.021 \| 0.018 \| 0.000 \| 0.000 \| \| 4 \| 0.044 \| 0.021 \| 0.000 \| 0.000 \| \| 5 \| 0.073 \| 0.047 \| 0.000 \| 0.003 \| \| 6 \| 0.101 \| 0.060 \| 0.000 \| 0.000 \| \| 8 \| 0.193 \| 0.104 \| 0.000 \| 0.005 \| \| 10 \| 0.294 \| 0.138 \| 0.000 \| 0.005 \| \| 12 \| 0.435 \| 0.232 \| 0.000 \| 0.010 \| \| 16 \| 0.711 \| 0.344 \| 0.000 \| 0.015 \| \| 20 \| 1.115 \| 0.549 \| 0.000 \| 0.026 \| \| 25 \| 1.573 \| 0.828 \| 0.000 \| 0.047 \| +----+-------+-------+-------+-------+ [download] (I took liberty to use couple of numbers as args to `ndcoords` instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though `ndcoords` builds a relatively large ndarray to feed to `range`, I think `range` is simply not written with speed/performance as its goal. It's actually tempting to try to improve Game of Life PDL implementation from the tutorial: use strict; use warnings; use experimental qw/ say postderef signatures /; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant STEPS => 100; my $x = zeroes( 200, 200 ); # Put in a simple glider. $x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my $backup = $x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", $x-> info, STEPS; # Tutorial my $t = time; my $ct = 0; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1); $n = $n->sumover->sumover - $x; $ct += time - $t_; # Calculate the next generation. $x = ((($n == 2) + ($n == 3))* $x) + (($n==3) * !$x); } printf "Tutorial: %0.3f s (core time: %0.3f)\n", time - $t, $ct; # "Lags" my $m = $backup-> copy; $t = time; $ct = 0; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = sms_GoL_lags( $m ) - $m; $ct += time - $t_; # Calculate the next generation. $m = ((($n == 2) + ($n == 3))* $m) + (($n == 3) * !$m); } printf "\"lags\": %0.3f s (core time: %0.3f)\n", time - $t, $ct; die unless eq_pdl( $x, $m ); sub _do_dimension_GoL ( $m ) { $m-> slice( -1 )-> glue( 0, $m, $m-> slice( 0 )) -> lags( 0, 1, ( $m-> dims )[0] ) -> sumover -> slice( '', '-1:0' ) -> xchg( 0, 1 ) } sub sms_GoL_lags ( $m ) { _do_dimension_GoL _do_dimension_GoL $m } __END__ Game of Life! Matrix: PDL: Double D [200,200], 100 generations Tutorial: 1.016 s (core time: 0.835) "lags": 0.283 s (core time: 0.108) [download] Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out -- because next gen calculation is not very efficient (e.g. `$n == 3` array is built twice), but that's another story -- which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.	[reply] [d/l] [select]
Re^3: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2) by jo37 (Curate) on Dec 31, 2023 at 17:25 UTC
Hi Steve (assuming it's you), I haven't spent any effort on optimizing the solution. Thank you for running the tests! It is interesting to see that `->clump(2)->sumover` doesn't perform as good as `->sumover->sumover`. Actually I didn't look up the example in PDL::Broadcasting but wrote my solution from memory. Looks like I'm getting older :-) Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re^3: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2) by Anonymous Monk on Jan 02, 2024 at 12:24 UTC
Here is last (hopefully) instalment in this thread; it only concerns faster, compared to the PDL CPAN tutorial, "Game of Life" implementation. But maybe some features used are too advanced for a tutorial. Improvement comes from: To achieve wrap around, the ugly glueing in parent node is replaced with dicing along each axis. Very convenient function. Horrible expression to compute next generation (requires 8 operations on whole arrays) is replaced with shorter version (just 4). use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use PDL; use PDL::NiceSlice; use Test::PDL 'eq_pdl'; use constant { WIDTH => 20, HEIGHT => 20, STEPS => 1000, }; my $x = zeroes long, WIDTH, HEIGHT; # Put in a simple glider. $x(1:3,1:3) .= pdl ( [1,1,1], [0,0,1], [0,1,0] ); my $backup = $x-> copy; printf "Game of Life!\nMatrix: %s, %d generations\n", $x-> info, STEPS; # Tutorial my $t = time; for ( 1 .. STEPS ) { my $t_ = time; # Calculate the number of neighbours per cell. my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1); $n = $n->sumover->sumover - $x; # Calculate the next generation. $x = ((($n == 2) + ($n == 3))* $x) + (($n == 3) * !$x); } printf "Tutorial: %0.3f s\n", time - $t; # Faster my $m = $backup-> copy; $t = time; my $wrap_w = pdl [ reverse WIDTH - 1, ( 0 .. WIDTH - 1 ), 0 ]; my $wrap_h = pdl [ reverse HEIGHT - 1, ( 0 .. HEIGHT - 1 ), 0 ]; for ( 1 .. STEPS ) { my $n = $m -> dice_axis( 0, $wrap_w ) -> lags( 0, 1, WIDTH ) -> sumover -> dice_axis( 1, $wrap_h ) -> lags( 1, 1, HEIGHT ) -> xchg( 0, 1 ) -> sumover; $n -= $m; $m = ( $n == 3 ) \| $m & ( $n == 2 ) } printf "Faster: %0.3f s\n", time - $t; die unless eq_pdl( $x, $m ); __END__ Game of Life! Matrix: PDL: Long D [20,20], 1000 generations Tutorial: 0.341 s Faster: 0.111 s Matrix: PDL: Long D [200,200], 100 generations Tutorial: 0.845 s Faster: 0.086 s Matrix: PDL: Long D [1000,1000], 20 generations Tutorial: 4.422 s Faster: 0.443 s Matrix: PDL: Long D [2000,2000], 10 generations Tutorial: 8.878 s Faster: 0.872 s [download]	[reply] [d/l]
Re^4: Fast sliding submatrix sums with PDL (inspired by PWC 248 task 2) by Anonymous Monk on Jan 03, 2024 at 12:02 UTC
Oops, sorry, one more: subtraction is not required of course, and, more importantly, next gen calculation expression can be split in 2 lines -- same 4 ops but presumably done in-place i.e. faster: `$m &= $n == 4; $m \|= $n == 3; __END__ Game of Life! Matrix: PDL: Long D [2000,2000], 10 generations Faster+: 0.677 s` [download]	[reply] [d/l]

Thanks for great challenge, first of all (you were our taskmaster, right?) I didn't mention your solution, because it was kind of slow for comparisons. But now I see where it and challenge itself have originated. PDL is very TIMTOWTDI-ish, and range is ultimately important tool, to extract/address rectangular areas from ndarrays of any dimensions. But, eh-hm, don't you see that the POD you linked sings praises to wonders of broadcasting, which (i.e. broadcasting) you simply discarded? Broadcasting really only happens in this fragment:

...-> sumover-> sumover
[download]

which you replaced with

...-> clump(2)-> sumover
[download]

(Frankly, it's obvious I think that huge speed difference of Game-of-Life implementations in the linked tutorial is due to the way looping was generally performed rather than this "broadcasting" only, -- but perhaps that's how tutorials work.)

Consider:

sub sms_WxH_PDL_range ( $m, $w, $h ) {
    my ( $W, $H ) = $m-> dims;
    
    $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ])
      -> reorder( 2, 3, 0, 1 )
      -> clump( 2 )
      -> sumover
}

sub sms_WxH_PDL_range_b ( $m, $w, $h ) {
    my ( $W, $H ) = $m-> dims;
    
    $m-> range( ndcoords( $W - $w + 1, $H - $h + 1 ), [ $w, $h ])
      -> reorder( 2, 3, 0, 1 )
      -> sumover
      -> sumover
}

__END__

     Time (s) vs. N (NxN submatrix, PDL: Double D [300,300] matrix)
      +-----------------------------------------------------------+
      |+          +         +          +          +          +    |
  1.6 |-+                                                    A  +-|
      |                                                           |
      |                                                           |
  1.4 |-+                                                       +-|
      |                                                           |
  1.2 |-+                                                       +-|
      |                                           A               |
      |                                                           |
    1 |-+                                                       +-|
      |                                                           |
      |                                                      B    |
  0.8 |-+                                                       +-|
      |                                  A                        |
      |                                                           |
  0.6 |-+                                         B             +-|
      |                                                           |
  0.4 |-+                        A                              +-|
      |                                  B                        |
      |                     A                                     |
  0.2 |-+               A        B                              +-|
      |             A   B   B                                     |
      |        A  B B                             D          D    |
    0 |-+  D D D  D D   D   D    D       D        C          C  +-|
      |+          +         +          +          +          +    |
      +-----------------------------------------------------------+
       0          5         10         15         20         25
                        sms_WxH_PDL_range    A
                      sms_WxH_PDL_range_b    B
                         sms_WxH_PDL_lags    C
                        sms_WxH_PDL_naive    D
  +----+-------+-------+-------+-------+
  | N  | A     | B     | C     | D     |
  +----+-------+-------+-------+-------+
  | 2  | 0.015 | 0.008 | 0.000 | 0.000 |
  | 3  | 0.021 | 0.018 | 0.000 | 0.000 |
  | 4  | 0.044 | 0.021 | 0.000 | 0.000 |
  | 5  | 0.073 | 0.047 | 0.000 | 0.003 |
  | 6  | 0.101 | 0.060 | 0.000 | 0.000 |
  | 8  | 0.193 | 0.104 | 0.000 | 0.005 |
  | 10 | 0.294 | 0.138 | 0.000 | 0.005 |
  | 12 | 0.435 | 0.232 | 0.000 | 0.010 |
  | 16 | 0.711 | 0.344 | 0.000 | 0.015 |
  | 20 | 1.115 | 0.549 | 0.000 | 0.026 |
  | 25 | 1.573 | 0.828 | 0.000 | 0.047 |
  +----+-------+-------+-------+-------+
[download]

(I took liberty to use couple of numbers as args to ndcoords instead of matrix/slice, which only serves as source of these 2 numbers). Note, the matrix is now smaller than in previous tests. Both A and B versions are very much slower than the so far slowest "naive" variant. Though ndcoords builds a relatively large ndarray to feed to range, I think range is simply not written with speed/performance as its goal.

It's actually tempting to try to improve Game of Life PDL implementation from the tutorial:

use strict;
use warnings;
use experimental qw/ say postderef signatures /;
use Time::HiRes 'time';

use PDL;
use PDL::NiceSlice;
use Test::PDL 'eq_pdl';

use constant STEPS => 100;
my $x = zeroes( 200, 200 );
 
# Put in a simple glider.
$x(1:3,1:3) .= pdl ( [1,1,1],
                     [0,0,1],
                     [0,1,0] );
my $backup = $x-> copy;

printf "Game of Life!\nMatrix: %s, %d generations\n", 
    $x-> info, STEPS;

# Tutorial
my $t = time;
my $ct = 0;
for ( 1 .. STEPS ) {
    my $t_ = time;
    # Calculate the number of neighbours per cell.
    my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1);
    $n = $n->sumover->sumover - $x;
    $ct += time - $t_;

    # Calculate the next generation.
    $x = ((($n == 2) + ($n == 3))* $x) + (($n==3) * !$x);
}
printf "Tutorial: %0.3f s (core time: %0.3f)\n",
    time - $t, $ct;

# "Lags"
my $m = $backup-> copy;
$t = time;
$ct = 0;
for ( 1 .. STEPS ) {
    my $t_ = time;
    # Calculate the number of neighbours per cell.
    my $n = sms_GoL_lags( $m ) - $m;
    $ct += time - $t_;

    # Calculate the next generation.
    $m = ((($n == 2) + ($n == 3))* $m) + (($n == 3) * !$m);
}
printf "\"lags\":   %0.3f s (core time: %0.3f)\n",
    time - $t, $ct;
    
die unless eq_pdl( $x, $m );

sub _do_dimension_GoL ( $m ) {
    $m-> slice( -1 )-> glue( 0, $m, $m-> slice( 0 ))
      -> lags( 0, 1, ( $m-> dims )[0] )
      -> sumover
      -> slice( '', '-1:0' )
      -> xchg( 0, 1 )
}

sub sms_GoL_lags ( $m ) {
    _do_dimension_GoL
    _do_dimension_GoL $m 
}

__END__

Game of Life!
Matrix: PDL: Double D [200,200], 100 generations
Tutorial: 1.016 s (core time: 0.835)
"lags":   0.283 s (core time: 0.108)
[download]

Sorry about crude profiling/tests; and improvement is somewhat far from what I expected. Even with "core time" singled out -- because next gen calculation is not very efficient (e.g. $n == 3 array is built twice), but that's another story -- which is "only" 8x better. Maybe all this glueing/appending to maintain constant matrix size and "wrap around" at edges takes its toll.

[reply]
[d/l]
[select]

Hi Steve (assuming it's you),

I haven't spent any effort on optimizing the solution. Thank you for running the tests! It is interesting to see that ->clump(2)->sumover doesn't perform as good as ->sumover->sumover. Actually I didn't look up the example in PDL::Broadcasting but wrote my solution from memory. Looks like I'm getting older :-)

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

[reply]
[d/l]
[select]

Here is last (hopefully) instalment in this thread; it only concerns faster, compared to the PDL CPAN tutorial, "Game of Life" implementation. But maybe some features used are too advanced for a tutorial. Improvement comes from:

To achieve wrap around, the ugly glueing in parent node is replaced with dicing along each axis. Very convenient function.
Horrible expression to compute next generation (requires 8 operations on whole arrays) is replaced with shorter version (just 4).

use strict;
use warnings;
use feature 'say';
use Time::HiRes 'time';

use PDL;
use PDL::NiceSlice;
use Test::PDL 'eq_pdl';

use constant {
    WIDTH  => 20,
    HEIGHT => 20,
    STEPS  => 1000,
};
my $x = zeroes long, WIDTH, HEIGHT;

# Put in a simple glider.
$x(1:3,1:3) .= pdl ( [1,1,1],
                     [0,0,1],
                     [0,1,0] );
my $backup = $x-> copy;

printf "Game of Life!\nMatrix: %s, %d generations\n", 
    $x-> info, STEPS;

# Tutorial
my $t = time;
for ( 1 .. STEPS ) {
    my $t_ = time;
    # Calculate the number of neighbours per cell.
    my $n = $x->range(ndcoords($x)-1,3,"periodic")->reorder(2,3,0,1);
    $n = $n->sumover->sumover - $x;

    # Calculate the next generation.
    $x = ((($n == 2) + ($n == 3))* $x) + (($n == 3) * !$x);
}
printf "Tutorial: %0.3f s\n", time - $t;

# Faster
my $m = $backup-> copy;
$t = time;
my $wrap_w = pdl [ reverse WIDTH - 1, ( 0 .. WIDTH - 1 ), 0 ];
my $wrap_h = pdl [ reverse HEIGHT - 1, ( 0 .. HEIGHT - 1 ), 0 ];
for ( 1 .. STEPS ) {
    my $n = $m
        -> dice_axis( 0, $wrap_w )
        -> lags( 0, 1, WIDTH )
        -> sumover
        -> dice_axis( 1, $wrap_h )
        -> lags( 1, 1, HEIGHT )
        -> xchg( 0, 1 )
        -> sumover;
    $n -= $m;
    $m = ( $n == 3 ) | $m & ( $n == 2 )
}
printf "Faster:   %0.3f s\n", time - $t;

die unless eq_pdl( $x, $m );

__END__

Game of Life!
Matrix: PDL: Long D [20,20], 1000 generations
Tutorial: 0.341 s
Faster:   0.111 s

Matrix: PDL: Long D [200,200], 100 generations
Tutorial: 0.845 s
Faster:   0.086 s

Matrix: PDL: Long D [1000,1000], 20 generations
Tutorial: 4.422 s
Faster:   0.443 s

Matrix: PDL: Long D [2000,2000], 10 generations
Tutorial: 8.878 s
Faster:   0.872 s
[download]

[reply]
[d/l]

Oops, sorry, one more: subtraction is not required of course, and, more importantly, next gen calculation expression can be split in 2 lines -- same 4 ops but presumably done in-place i.e. faster:

    $m &= $n == 4;
    $m |= $n == 3;

__END__

Game of Life!
Matrix: PDL: Long D [2000,2000], 10 generations
Faster+:   0.677 s
[download]

[reply]
[d/l]