in reply to Re: An efficient, scalable matrix transformation algorithm
in thread An efficient, scalable matrix transformation algorithm

Ah, sorry, you're right. It would be O(n*m) (where m is the number of reduction functions, and n is the number of records).

Indeed, the issue is scale. The original data comes in by sampling various parameters of numerous biological electronic sampling devices over time, and so each record keeps a timestamp. The "a" column would, in this case, be a timestamp, and the others would be, say, averages, maxes, minimums for that time window. This needs to be done for a huge number of devices in parallel, or at least in a linear fashion that won't introduce delays, as this has to happen inline with the sampling before it gets archived.

The reason for the transformation is for the purpose of re-scaling the granularity of the data.

A good example would be:

time+microseconds    AVGTEMP   MAXTEMP   MINTEMP   STARTTEMP   ENDTEMP

In order to rescale this type of data to a larger time granularity (say, 5 second chunks, 1 minute chunks, etc.), you need to perform the different functions on each column to make them reflect the proper averages, maximums, minimums, etc. for the new time window.

(I forgot to mention that I also considered RRD for this, but it doesn't have enough consolidation functions included, and adding new ones is far from trivial. Will update the original node).

Ok, that's all verbose. Here's a code sample, as simple as I can make it and still have it work:

#!/usr/bin/perl use strict; use warnings; use List::Util qw( min max sum ); use Data::Dumper; my @funcs = ( undef, # first_timestamp \&avg, \&max, \&min, \&first, \&last, ); my @records = ( # Timestamp.usecs AVGTEMP MAXTEMP MINTEMP STARTTEMP ENDTEMP [ 1234567891.123456, 36.5, 36.9, 36.2, 36.4, 36.6 ], [ 1234567891.654321, 36.6, 40.1, 36.2, 36.6, 36.8 ], [ 1234567893.123456, 36.3, 38.8, 35.1, 36.8, 37.3 ], [ 1234567893.654321, 36.2, 36.9, 36.2, 37.3, 37.1 ], [ 1234567894.123456, 36.8, 37.3, 36.2, 37.1, 37.4 ], ); # Main print " Timestamp AVG MAX MIN START END\n"; print( '[ ', join( ', ', @{ $_ } ), " ]\n" ) for( @records ); print "\n"; my @output = downsample( 5, \@records ); print " Timestamp AVG MAX MIN START END\n"; print '[ ', join( ', ', @output ), " ]\n"; # Subs sub downsample { my( $resolution, $data ) = @_; my $newdata = transpose( $data ); my $first_timestamp = first_timestamp_for_resolution( $resolution, $newdata->[0][0] +); my @output; push( @output, $first_timestamp ); push( @output, $funcs[$_]->( @{$newdata->[$_]} ) ) for ( 1..$#func +s ); return( @output ); } sub first_timestamp_for_resolution { my( $resolution, $value ) = @_; # $resolution in seconds return( int( $value - ( $value % $resolution ) ) ); } # from Math::Matrix sub transpose { my( $matrix ) = shift; my( $m, @result ); for my $col ( @{ $matrix->[0] } ) { push( @result, [] ); } for my $row ( @{$matrix} ) { $m=0; for my $col ( @{$row} ) { push( @{ $result[$m++] }, $col ); } } return( \@result ); } sub first { return( $_[0] ); } sub last { return( $_[-1] ); } sub avg { return( sum( @_ ) / scalar( @_ ) ); }

Replies are listed 'Best First'.
Re^3: An efficient, scalable matrix transformation algorithm
by dHarry (Abbot) on Jan 28, 2009 at 13:42 UTC

    I am still not convinced you have to transpose the matrix but maybe I am having an off-day, i.e. to blind to see it;-) When I read to your post the first thing that jumps into mind is the use of a database., especially since your "reduce functions" are simple.

    However, assuming you insist on transposing, you might be able to speed things up. The case N != M is more difficult then N = M. The approach you take right now is the straightforward approach but this can give poor performance (as you have probably noticed).

    In general: depending on circumstances, things like how much storage you have available and more importantly how M relates to N (|N-M| small or gcd(N,M) is not small) you can improve by using a more refined algorithm. (This I have taken from the Wiki, already suggested in my earlier response.) You can find pointers there to articles on the subject of in-place matrix transposition. I don’t know of any Perl implementations but the pointers on the Wiki also contain source code so you should be able to rewrite that into Perl if needed.

    HTH, dHarry

      Well, this source is the degenerate case where the objects holding the records are turned into simple arrayrefs, to illustrate the algorithm. I'm willing and able to refactor, though, of course.

      The transposition happens because the reductions need to be applied to columns, and the functions sum(), avg(), etc. operate on lists. The only alternative to the full transposition I can see would be a column splice -> list/array, which essentially results in the same thing.

      i.e. I don't see a way to avoid the transposition, one way or another ... ? Maybe it's me that's missing something?

        It was the following text that triggered me (again from the Wiki):

        On a computer, one can often avoid explicitly transposing a matrix in memory by simply accessing the same data in a different order...

        Because your reduction functions are simple (not like some Fourier transformation) I thought you might get away with this. You could also consider to change your data structure.

Re^3: An efficient, scalable matrix transformation algorithm
by gone2015 (Deacon) on Jan 28, 2009 at 14:20 UTC

    I used Devel::Profile and found that 84.74% of the running time is in the main::transpose function.

    I don't know how the data in @records actually arrives there, but I would consider capturing it in an array of columns rather than an array of rows -- eliminating the transpose operation completely.

    I tried replacing the downsample() function and the functions themselves, to operate directly on the columns. (The code for that is included below.) In order to get some timings I expanded the sample @records by the simple expedient of @records = (@records) x 100_000. On my machine, operating directly on the columns downsampled the records in 1.1s, where your code took 12.8s.

    Mind you, I doubt whether you downsample this many samples at once, so these timings will overstate the impact of improving the inside of the downsample function.

    Finally, depending on how the data is input, I would consider combining the downsampling with the input -- avoiding construction of the @records array altogether...

      FWIW, this transpose runs ~3.5 times faster than the loop you had (on my machine, anyway). The message here is that the trick with Perl is to replace a for or while by map or grep if possible !

      sub transpose_gmch { my ($matrix) = shift ; my @result = () ; my $lc = $#{$matrix->[0]} ; for my $col (0..$lc) { push @result, [map $_->[$col], @$matrix] ; } ; return( \@result ); }

        Ah, too much good stuff!

        It seems that the transposition step is unnecessary for this algorithm, anyway, but that is a much better implementation, regardless, and I'll incorporate it for later use. Quite the performance jump that map() provides. Mmm, lispy goodness.

        It might further be worthwhile to pass it on in the RT to the Math::Matrix maintainer as a performance patch, since that's where I took the original function from, also.

        Thank you again, kind and knowledgeable good sir monk.

      Wonderful!

      Indeed, this is what dHarry was getting at, too. Clearly the transpose() step is extraneous, but I guess I was blinded by the way in which my object structure was currently set up to see a manner to address the individual columns. (But that's my own issue, and is clear now.)

      Thank you both!