An efficient, scalable matrix transformation algorithm

Luftkissenboot has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Monks. I hope you all had a very pleasant Kwanzaa. I am seeking your advice on the most (or at least, a very) efficient method to perform the following transformation:

Given a matrix of rank 2, perform individual consolidation functions on each column, resulting in one "row" of data, with each field having been transformed by its corresponding function.

F1 F2 F3 F4 F5 F6 :

a1 b1 c1 d1 e1 g1    ->   A1 B1 C1 D1 E1 G1
a2 b2 c2 d2 e2 g2
a3 b3 c3 d3 e3 g3
a4 b4 c4 d4 e4 g4
a5 b5 c5 d5 e5 g5
a6 b6 c6 d6 e6 g6
a7 b7 c7 d7 e7 g7
a8 b8 c8 d8 e8 g8

Parameters: The source data always has the same number of columns, and the same reduction functions are always applied to the columns. The data is supplied in "chunks", at which point the consolidation functions need to be applied to that "chunk" of data and output one "record." Examples of consolidation functions are min(), max(), avg(), sum(), etc...

(This is not homework, it's just a really potentially computationally complex problem, that has nearly an infinite number of Bad™ solutions, and I'd like to do the Right Thing™ and perhaps learn and share something new in the process).

Update: Code sample in this reply: Re^2: An efficient, scalable matrix transformation algorithm

My current, already-implemented and workable solution follows roughly the following steps:

I have an arrayref map that maps each element to a coderef for the appropriate function for that column.
Transpose the matrix (a slow process in and of itself).
Supply the individual rows (former columns) to the appropriate function and get the return value for that function. (for speed, where possible, I use List::Util, since it's written in XS).
Collect the individual return values and rebuild them into the resulting "record", returning this.

It works "fair-to-middling", but the matrix transposition step itself adds a significant overhead (I used a decent algorithm, though admittedly, it's in PurePerl, as I couldn't find an XS implementation).

The issue is that this is an O(n*m) solution, and as it should be well-covered/considered territory, as I cannot be the first person to ever have to perform this type of transformation in the history of computing/mathematics/stream/signal processing, etc., I can't help thinking that there must be a more efficient solution.

NOTE: Technically, the transposition step, while adding significant overhead, may actually result in an aggregate time reduction, as some of the reduction functions do not operate over the full set of supplied data, and may return immediately or after partial processing. So, it ends up being not quite as bad as a full O(n*m). If it were not transposed, this would not be the case.

I thus humbly throw myself upon the mercy of the monks to discuss possible alternative solutions to this problem. TBH, it's the algorithm that I am mostly interested in, as a brute-force solution will always do the speedup, but within the algorithm lies the beauty.

------

FYI, here is a list of some other things I have considered, tried, or looked into:

PDL, the Perl Data Language. Specifically, the functionality provided by the PDL::Reduce module.
- PDL is speedy! However, the PDL::Reduce functions appear to work by reducing the rank of the matrix by the number of dimensions specified, by applying the same reduce() function across ALL columns. I need them able to be individually specified.
Offloading it to the GPU with some sort of a GPGPU implementation. GPUs offer massively scalable stream-processing function across multiple data streams simultaneously. If I were to treat the columns as "streams"
- Still appears to require the matrix transposition step to feed the "streams" into the GPU.
- Shares the same drawback as the PDL reduce() solution - without lots of legwork, the same vector function appears to be applied across all columns ("streams").
RRD, the Round-robin Database Tool, which is sort of geared for this type of data reduction with its Consolidation Function interface.
- It doesn't have enough consolidation functions, and after digging into the source code looking into adding new ones, it is FAR from a trivial task.
Setting up individual processes that perform the operations as optimally as possible and return their results to the calling application.
- Still requires matrix transposition step.
- IPC overhead may obviate any speed gains.

(You can see from these that that I am more an engineer than a mathematician, but I am not stupid and willing to learn.)

Comment on An efficient, scalable matrix transformation algorithm Download Code

Replies are listed 'Best First'.
Re: An efficient, scalable matrix transformation algorithm by etj (Priest) on Jun 03, 2022 at 22:38 UTC
My approach to this would be to make an input ndarray out of your 2D input ("rank 2 matrix"). Then, given that loop-fusion is not yet a reality in PDL (see https://github.com/PDLPorters/pdl/issues/349), I'd use Inline::Pdlpp to process one row at a time with a custom function; see https://github.com/Fourmilab/floating_point_benchmarks/pull/1/files for inspiration on this. An alternative approach would be to use MCE, and in one "core" each, use a PDL builtin like `avgover` etc to do one of the various kinds of processing. The results could be melded together as the final step.	[reply] [d/l]
Re: An efficient, scalable matrix transformation algorithm by dHarry (Abbot) on Jan 28, 2009 at 10:30 UTC
I’m not sure if I understand what you are trying to do, i.e. it makes no sense to me. I don’t understand your reduction functions, see Reducing a matrix for how to reduce a matrix. On the subject of transposing matrices books have been written. There are many algorithms around, also parallel algorithms exist. But why do you have to transpose the matrix? What problem are you trying to solve, how do you end up with the matrix in the first place? Update Wiki is an excellent starter, it sort of reanimated my knowledge on the subject.	[reply]
Re: An efficient, scalable matrix transformation algorithm by gone2015 (Deacon) on Jan 28, 2009 at 10:16 UTC
From your description what you need to do is: A1 = F1(a1, a2, a3, ...., an) B1 = F2(b1, b2, b3, ...., bn) C1 = F3(c1, c2, c3, ...., cn) on the face of it, that doesn't appear to be O(n^2). You transpose the original matrix so that (a1, a2, ...) etc. are Perl arrays, presumably for the convenience of F1 etc. -- which seems reasonable, though the work to pull a column into a row seems about the same as processing the column directly (unless elements are accessed more than once by the respective Fn. I assume the issue here is scale. But I regret this humble engineer is struggling to understand where to look for improvement. (Assuming a multi-core processor I can see that processing columns in parallel makes sense -- especially if the original matrix were in shared memory !) You don't say where the matrix comes from... I wonder if it's possible to transpose it as it's read in ? So... can you post some code ? Preferrably reduced to show the key logic... and give some idea of the true scale ? Of course, the general advice is to profile what you have, to identify where the time is really being spent.	[reply]
Re^2: An efficient, scalable matrix transformation algorithm by Luftkissenboot (Novice) on Jan 28, 2009 at 11:58 UTC
Ah, sorry, you're right. It would be O(n*m) (where m is the number of reduction functions, and n is the number of records). Indeed, the issue is scale. The original data comes in by sampling various parameters of numerous biological electronic sampling devices over time, and so each record keeps a timestamp. The "a" column would, in this case, be a timestamp, and the others would be, say, averages, maxes, minimums for that time window. This needs to be done for a huge number of devices in parallel, or at least in a linear fashion that won't introduce delays, as this has to happen inline with the sampling before it gets archived. The reason for the transformation is for the purpose of re-scaling the granularity of the data. A good example would be: time+microseconds AVGTEMP MAXTEMP MINTEMP STARTTEMP ENDTEMP In order to rescale this type of data to a larger time granularity (say, 5 second chunks, 1 minute chunks, etc.), you need to perform the different functions on each column to make them reflect the proper averages, maximums, minimums, etc. for the new time window. (I forgot to mention that I also considered RRD for this, but it doesn't have enough consolidation functions included, and adding new ones is far from trivial. Will update the original node). Ok, that's all verbose. Here's a code sample, as simple as I can make it and still have it work: Read more... (2 kB)	[reply] [d/l]
Re^3: An efficient, scalable matrix transformation algorithm by dHarry (Abbot) on Jan 28, 2009 at 13:42 UTC
I am still not convinced you have to transpose the matrix but maybe I am having an off-day, i.e. to blind to see it;-) When I read to your post the first thing that jumps into mind is the use of a database., especially since your "reduce functions" are simple. However, assuming you insist on transposing, you might be able to speed things up. The case N != M is more difficult then N = M. The approach you take right now is the straightforward approach but this can give poor performance (as you have probably noticed). In general: depending on circumstances, things like how much storage you have available and more importantly how M relates to N (\|N-M\| small or gcd(N,M) is not small) you can improve by using a more refined algorithm. (This I have taken from the Wiki, already suggested in my earlier response.) You can find pointers there to articles on the subject of in-place matrix transposition. I don’t know of any Perl implementations but the pointers on the Wiki also contain source code so you should be able to rewrite that into Perl if needed. HTH, dHarry	[reply]
Re^4: An efficient, scalable matrix transformation algorithm by Luftkissenboot (Novice) on Jan 28, 2009 at 14:05 UTC
Re^5: An efficient, scalable matrix transformation algorithm by dHarry (Abbot) on Jan 28, 2009 at 14:46 UTC
Re^3: An efficient, scalable matrix transformation algorithm by gone2015 (Deacon) on Jan 28, 2009 at 14:20 UTC
I used Devel::Profile and found that 84.74% of the running time is in the `main::transpose` function. I don't know how the data in `@records` actually arrives there, but I would consider capturing it in an array of columns rather than an array of rows -- eliminating the transpose operation completely. I tried replacing the `downsample()` function and the functions themselves, to operate directly on the columns. (The code for that is included below.) In order to get some timings I expanded the sample `@records` by the simple expedient of `@records = (@records) x 100_000`. On my machine, operating directly on the columns downsampled the records in 1.1s, where your code took 12.8s. Mind you, I doubt whether you `downsample` this many samples at once, so these timings will overstate the impact of improving the inside of the `downsample` function. Finally, depending on how the data is input, I would consider combining the downsampling with the input -- avoiding construction of the `@records` array altogether... Read more... See the code (1463 Bytes)	[reply] [d/l] [select]
Re^4: An efficient, scalable matrix transformation algorithm by gone2015 (Deacon) on Jan 28, 2009 at 18:05 UTC
Re^5: An efficient, scalable matrix transformation algorithm by Luftkissenboot (Novice) on Jan 29, 2009 at 07:37 UTC
Re^4: An efficient, scalable matrix transformation algorithm by Luftkissenboot (Novice) on Jan 28, 2009 at 15:21 UTC
Re: An efficient, scalable matrix transformation algorithm by roboticus (Chancellor) on Jan 28, 2009 at 18:34 UTC
Luftkissenboot: If you can make all your consolidation functions able to work incrementally, you could process your data one row at a time. Even if you can't make the function work incrementally, you might be able to transform it such that you can use partial results and combine them at the end (last line of example before __DATA__ statement). Something like this: #!/usr/bin/perl -w # # Sploink.pl # use strict; use warnings; sub make_summer { my $sum = 0; return sub { $sum += $_[0]; } } sub make_minner { my $min = 9.99e99; return sub { $min = $_[0] if $_[0] < $min; return $min; } } sub make_avger { my ($cnt, $sum) = (0, 0); return sub { $sum += $_[0]; ++$cnt; return $sum/$cnt; } } sub make_counter { my $cnt = 0; return sub { ++$cnt } } my @funcs = ( make_summer(), make_minner(), make_avger(), make_avger(), make_counter(), ); my @results; while (<DATA>) { chomp; my @flds = split /\s+/; $results[$_] = $funcs[$_]($flds[$_]) for (0 .. $#funcs); } print "results: ", join(", ", @results), "\n"; print "avg of column 1 is: ", $results[0] / $results[4], "\n"; __DATA__ 10 20 30 40 12 14 16 18 9 10 11 12 13 14 15 16 [download] Which gives the following: `roboticus@swill: /Work/Perl/PerlMonks $ ./sploink_739466.pl results: 44, 10, 18, 21.5, 4 avg of column 1 is: 11` [download] ...roboticus UPDATE: Added program output	[reply] [d/l] [select]
Re^2: An efficient, scalable matrix transformation algorithm by Luftkissenboot (Novice) on Jan 29, 2009 at 07:30 UTC
Actually, that is very feasible. Now that I've dug further into it from previous conversations also, it would lend itself to it, and it would be a major memory saver, to boot. So many good suggestions. Thank you!	[reply]
Re: An efficient, scalable matrix transformation algorithm by Bloodnok (Vicar) on Jan 28, 2009 at 09:33 UTC
Hmmm, thinx: If there are ... an infinite number of Bad solutions ..., is this an attempt to eliminate a number of them in the hope that, to paraphrase Sherlock Holmes, ...whatever remains, however improbable, must point to the answer ? :D A user level that continues to overstate my experience :-))	[reply]
Re^2: An efficient, scalable matrix transformation algorithm by Luftkissenboot (Novice) on Jan 28, 2009 at 09:57 UTC
Well, that would certainly be effective. But I was really hoping for someone more well-versed than I in linear algebra or parallelism to share a flash of inspiration. It's either that, or go with the Seven-Percent Solution, which would probably make it difficult to analyse suggested solutions properly. ;)	[reply]