knirirr has asked for the wisdom of the Perl Monks concerning the following question:

I am currently a bit stuck on the following problem. I have generated a load of csv tables looking rather like this:
units,1,2,3,4,5,6
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,1,1,0,0
4,0,4,0,0,0,0
5,0,3,0,0,0,0
6,0,0,0,0,0,0
7,0,0,0,0,0,0
8,0,0,0,0,0,0
9,1,0,0,0,0,0
The x-axis (1-6) is "type", and the y-axis (1-9) is a count, the value in the table being the number of things of that type and length (e.g. there is one thing of length 9 and type 1). I need to obtain the mean and standard deviation for position in the table, over all the tables (e.g. what is the mean number/SD of things of type 1 and length 9 over many tables?). Something involving matrices seems like the way to go, such as described here. However, I've no idea how to perform the calculations and return a table containing the mean and and SD values once I've got the data into a matrix (and that's hard enough ;). Can anyone offer a suggestion?
  • Comment on Mean and standard deviation amongst multiple matrices

Replies are listed 'Best First'.
Re: Mean and standard deviation amongst multiple matrices
by conrad (Beadle) on Dec 06, 2004 at 14:48 UTC

    How about this? I'm not going into details, but basically if you represent your matrices as two-dimensional arrays (here I'm actually using array references), it's pretty straightforward to do the job.

    I've coded standard deviation from memory so you might do well to check up the formula elsewhere - my memory is that you divide by (n - 1) rather than n (and here, "@matrices" evaluated in a scalar context will be n - the number of tables you're performing statistics over).

    I'm loading the tables from the __DATA__ segment; should be obvious how to modify it to use stdin, a collection of files, or whatever. Have assumed that all of your tables will say 1..9 and 1..6 in their row/column labels.

    There's loads of stuff to say on numerical accuracy when calculating means, but if you're using Perl to do the job I assume you don't care too much - the accuracy problems arise on pretty odd data sets anyway and aren't likely to be a problem.

    To understand all of the [ foo ] and $x->[][] stuff, I recommend that you read the Perl documentation on references.

    Hope this helps.

    #!/usr/bin/perl -w use strict; # Convert your data format to a 2-d array sub to_matrix($) { [ map [ split /,/ ], split /\n/, shift ] } # Convert a 2-d array to your data format sub from_matrix($) { join("\n", map join(",", @$_), @{$_[0]}) . "\n"; } # Read in matrices $/ = "\n\n"; my @matrices = map to_matrix($_), <DATA>; # Compute mean/std.dev my $mean = [ [ "mean" ] ]; my $standard_deviation = [ [ "standard deviation" ] ]; $mean->[0][$_] = $standard_deviation->[0][$_] = $_ for 1 .. 6; $mean->[$_][0] = $standard_deviation->[$_][0] = $_ for 1 .. 9; for my $row (1 .. 9) { for my $column (1 .. 6) { my $sum = 0; my $sum_of_squares = 0; for(@matrices) { my $datum = $_->[$row][$column]; $sum += $datum; $sum_of_squares += $datum * $datum; } $mean->[$row][$column] = $sum / @matrices; $standard_deviation->[$row][$column] = sqrt(($sum_of_squares - $sum * $sum / @matrices) / (@matrices - 1)); } } print from_matrix($mean), "\n", from_matrix($standard_deviation); __DATA__ units,1,2,3,4,5,6 1,5,0,0,0,0,0 2,0,0,0,0,0,0 3,0,0,1,1,0,0 4,0,4,0,0,0,0 5,0,3,0,0,0,0 6,0,0,0,0,0,0 7,0,0,0,0,0,0 8,0,0,0,0,0,0 9,1,0,0,0,0,0 blah,1,2,3,4,5,6 1,6,0,0,0,0,0 2,0,0,0,0,0,0 3,0,0,1,1,0,0 4,0,4,0,0,0,0 5,0,3,0,0,0,0 6,0,0,0,0,0,0 7,0,0,0,0,0,0 8,0,0,0,0,0,0 9,1,0,0,0,0,3 foo,1,2,3,4,5,6 1,7,0,0,0,0,0 2,0,0,0,0,0,0 3,0,0,1,1,0,0 4,0,4,0,0,0,0 5,0,3,0,0,0,0 6,0,0,0,0,0,0 7,0,0,0,0,0,0 8,0,0,0,0,0,0 9,1,0,0,0,0,2
      That works very nicely, thanks!
Re: Mean and standard deviation amongst multiple matrices
by zejames (Hermit) on Dec 06, 2004 at 11:51 UTC

    Although I'm not a statistic expert, I think you should have a look a PDL module, that provide very useful Perl interface to vector and matrix operations.

    HTH


    --
    zejames
      There is also (now) PDL::Stats by the mighty Maggie X.
Re: Mean and standard deviation amongst multiple matrices
by zentara (Cardinal) on Dec 06, 2004 at 12:57 UTC
    Finding statistics is not a simple task, because if it's not done correctly, it is erroneous.

    Anyways, I went to cpan, and searched for "statistics', and Statistics-Basic looks like a possibility for you. Matrices can be looked at like a collection of vectors, so you may be able to use this as an approach.

    Anyways, your question is a big one, and I wouldn't expect a simple solution where you just drop in your matrices, and have correct answers pop out.


    I'm not really a human, but I play one on earth. flash japh
Re: Mean and standard deviation amongst multiple matrices
by punch_card_don (Curate) on Dec 06, 2004 at 16:53 UTC
    Quick and dirty code illustrating the principle, undoubtedly fraught with syntax errors and incorrectly offset list indexes;
    @tables = ('csv_1', 'csv_2', ..., 'csv_n'); # or a directorylist to get all csv files in your directory $num_types = 6; $num_lengths = 9; for $table (0 .. $#tables) { #parse file into 2-d list open(FILE, $tables[$table]); while($line = <FILE>) { @line_parts = split(/delimiter/, $line); push @this_table, [ @line_parts ]; } close(FILE); #keep a running total of counts for each position for $length (1 .. $num_lengths) { for $type (1 .. $num_types) { $totals[$length][$type] = $totals[$length][$type] + $th +is_table[$length-1][$type-1]; $totals_squared[$length][$type] = $totals_squared[$leng +th][$type] + ($this_table[$length-1][$type-1]^2); } } } #you now have everything you need to calcluate mena and sd for every p +osition for $i (0 .. $#totals) { for $j (0 .. $#{$totals[$i]}) { #check syntax! $mean = $totals[$i][$j]/$#tables; $sd = sqrt( ($totals_squared[$i][$j] -($totals[$i][$j]^s/$#t +ables))/ ($#tables-1) ) #note that this considers the data as a 'sample'. #If it is to be considered a 'population', then replace "$#t +ables-1" with $#tables } }
    For info on the raw score method of calculating sd, see
    http://www.mnstate.edu/wasson/ed602calcsdraws.htm
    or
    http://www.med.umkc.edu/tlwbiostats/variability.html
      I see I type too slowly.....
Re: Mean and standard deviation amongst multiple matrices
by johnnywang (Priest) on Dec 06, 2004 at 18:42 UTC
    Just want to point out that since you're only computing the mean/SD for the same position (e.g., length 9, type 1), you don't need to put them in a matrix, a simple array will be good enough.
      Are you suggesting something like loading the first line of every table into an array, and performing a calculation upon that, then the next line, &c.? I might have a go at that.
      Conrad's method seems to be working well enough for now. To determine accuracy I printed out a few select cells thus:
      # @names is a list of file names, and I've created a matrix # named after each file... my $x = 3; my $y = 3; my @one; print "one <- data.frame(a=c("; foreach my $name (@names) { push (@one, @{$name}[$x]->[$y]); } print join(",", @one), "))\n";
      This provides a data frame one can stick into R and run mean(one) or sd(one), which for the examples I tested showed that conrad's results were accurate enough for my purposes. I must also compare punch_card_don's when I get an opportunity.