in reply to Re^4: An efficient, scalable matrix transformation algorithm
in thread An efficient, scalable matrix transformation algorithm

It was the following text that triggered me (again from the Wiki):

On a computer, one can often avoid explicitly transposing a matrix in memory by simply accessing the same data in a different order...

Because your reduction functions are simple (not like some Fourier transformation) I thought you might get away with this. You could also consider to change your data structure.