korsmo has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am working with a large .csv file with abundance data. It is set up as follows:

Month,Zone,Replicate,SpeciesA,SpeciesB,SpeciesC,...
Sept,1,1,5,10,15
Sept,1,2,0,5,10
Sept,1,3,5,0,5
Sept,2,1,5,5,5
Sept,2,2,10,15,10
Sept,2,3,0,0,5

I need to find the mean number (and std dev) of each individual species across three replicates, for each zone, for each month. In all, there are 20 zones that were sampled monthly over the course of a year. I'm new to Perl. I would appreciate any tips that point me in the right direction to be able to generate descriptive statistics for each Month-Zone-Species combination. Thanks.

Replies are listed 'Best First'.
Re: Descriptive Stats from .csv file
by kevbot (Vicar) on Jan 30, 2014 at 06:10 UTC

    I have found the Data::Table to be a nice module for these types of tasks. It borrows some ideas from R that are helpful for handling tabular data (it has an implementation of the melt and cast functions inspired by the R reshape module). You can accomplish your task with the following code (assuming that your data is saved in a file named data.csv). The author of Data::Table has also made additional documentation/information available here: https://sites.google.com/site/easydatabase/

    Data in data.csv:

    Month,Zone,Replicate,SpeciesA,SpeciesB,SpeciesC Sept,1,1,5,10,15 Sept,1,2,0,5,10 Sept,1,3,5,0,5 Sept,2,1,5,5,5 Sept,2,2,10,15,10 Sept,2,3,0,0,5
    #!/usr/bin/env perl use strict; use warnings; use Data::Table; use Statistics::Lite qw(mean stddev); my $dt = Data::Table::fromCSV("data.csv"); print "Original Data Table\n"; print "===================\n"; print $dt->tsv; print "\n\n"; my $melt = $dt->melt(['Month', 'Zone', 'Replicate']); print "Melt Table\n"; print "==========\n"; print $melt->tsv; print "\n\n"; my $cast_mean = $melt->cast( ['Month', 'Zone'], 'variable', Data::Table::STRING, 'value', \&mean ); print "Cast (mean)\n"; print "===========\n"; print $cast_mean->tsv; print "\n\n"; my $cast_stddev = $melt->cast( ['Month', 'Zone'], 'variable', Data::Table::STRING, 'value', \&stddev ); print "Cast (stddev)\n"; print "=============\n"; print $cast_stddev->tsv; exit;
    This code will give the following output:
      Hi,
      That seemed to work well for obtaining the values I need - thanks! I have loaded the module Excel::Writer::XLSX to try to bring those values into a .xlsx file. Do you know if I need to put the cast_means and cast_stddev values into an array before I can write the .xlsx file?
      Thanks again,
      BK

        You can use the Data::Table::Excel module to put the contents of a Data::Table object into an Excel file. From the documentation,

        This perl package provide utility methods to convert between an Excel file and Data::Table objects. It then enables you to take advantage of the Data::Table methods to further manipulate the data and/or export it into other formats such as CSV/TSV/HTML, etc.

        For example, this code will create an xlsx file that contains both the mean and stddev tables:

        #!/usr/bin/env perl use strict; use warnings; use Data::Table; use Data::Table::Excel qw(tables2xlsx); use Statistics::Lite qw(mean stddev); my $dt = Data::Table::fromCSV("data.csv"); my $melt = $dt->melt(['Month', 'Zone', 'Replicate']); my $cast_mean = $melt->cast( ['Month', 'Zone'], 'variable', Data::Table::STRING, 'value', \&mean ); my $cast_stddev = $melt->cast( ['Month', 'Zone'], 'variable', Data::Table::STRING, 'value', \&stddev ); tables2xlsx("descriptive_stats.xlsx", [ $cast_mean, $cast_stddev ]); exit;
Re: Descriptive Stats from .csv file
by Anonymous Monk on Jan 30, 2014 at 03:26 UTC
Re: Descriptive Stats from .csv file
by talexb (Chancellor) on Jan 30, 2014 at 14:39 UTC

    I would load the data into a database (sqlite is a great choice) and have the database do the work.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Or work through DBD::CSV to use those CSV tables directly.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
        Have you tried it?

      If you have postgresql on hand, then it's as easy as :

      create table data( month text, zone integer, replicate integer, speciesa integer, speciesb integer, speciesc integer); copy data from stdin with delimiter ','; sept,1,1,5,10,15 Sept,1,2,0,5,10 Sept,1,3,5,0,5 Sept,2,1,5,5,5 Sept,2,2,10,15,10 Sept,2,3,0,0,5 \. select month, zone, replicate, avg(speciesa), stddev(speciesa) from da +ta group by month,zone, replicate; month | zone | replicate | avg | stddev -------+------+-----------+------------------------+-------- Sept | 1 | 2 | 0.00000000000000000000 | Sept | 2 | 3 | 0.00000000000000000000 | Sept | 1 | 3 | 5.0000000000000000 | Sept | 2 | 2 | 10.0000000000000000 | sept | 1 | 1 | 5.0000000000000000 | Sept | 2 | 1 | 5.0000000000000000 | (6 lignes)


      libremen.com : legal case, contract and insurance claim management software