in reply to An efficient, scalable matrix transformation algorithm

My approach to this would be to make an input ndarray out of your 2D input ("rank 2 matrix"). Then, given that loop-fusion is not yet a reality in PDL (see https://github.com/PDLPorters/pdl/issues/349), I'd use Inline::Pdlpp to process one row at a time with a custom function; see https://github.com/Fourmilab/floating_point_benchmarks/pull/1/files for inspiration on this.

An alternative approach would be to use MCE, and in one "core" each, use a PDL builtin like avgover etc to do one of the various kinds of processing. The results could be melded together as the final step.