dekul has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing a CSV file and I would like to turn the columns into individual arrays contained in a hash which I can access by name (names are contained in the header line of the input file).

Example input file (MyCSV.csv):

Name,Cost,Inventory pickles,2.99,12 vegemite,4.00,8 nuclear sub,22.50,4

What I have done thus far:

open IN, "MyCSV.csv" or die "Cannot open input file: $!\n"; my ($line, @colNames, $size, %columns, $j); $j = 0; chomp($line =<IN>); # Read the column header @colNames = split(',',$line); $size = scalar @colNames; while($line = <IN>) { chomp($line); push @{$columns{$colNames[$j++ % $size]}}, $_ for(split(',', $line +)); } # Now I can play with whole columns of data by name print “Total Inventory = ”, List::Util::sum(@{$columns{'Inventory'}}), + “\n”; print “Total Cost = ” , List::Util::sum(@{$columns{'Cost'}}), “\n”;

A couple of quick notes:

  1. I know I should be using Text::CSV or Text::CSV_XS, but this is just for testing
  2. I am fairly new to Perl, but I really like it and I think there is a better more Perl-tastic way of doing this
  3. I tried preallocating the arrays in the hash (using $#@{$columns{$colName}} = 100, where $colName came from iterating over @colNames), but I ran into problems where push started adding at position 101 and leaving the preallocated slots empty

I really like this approach because I don't have to create $size arrays, I just need to know what the columns are called (which is fairly standardized for my data files). I thought there might be a way of transposing the rows/columns after reading the file completely into memory then I could just store Text::CSV(_XS)->fields() as my columns of data, but I didn't see such a facility in Text::CSV(_XS).

The runtime on this isn't horrible, but it is still linear in the number of columns (though not so linear in the number of rows). I have done time trials on 16 to 128 columns with 10,000 rows and the runtime is between 0.2 and 0.9 seconds, respectively, but I feel it could be better. Any thoughts are greatly appreciated.

Thanks.
- Tim

Replies are listed 'Best First'.
Re: CSV Columns: Friend or foe?
by MidLifeXis (Monsignor) on Jul 27, 2010 at 13:43 UTC

    I would suggest a different data structure. You are using a hash of arrays. I would suggest an array of hashes, one for each row. Many times, a parallel set of arrays (HoA) can be better handled as an array of hashes (AoH) with map to pull the requested column.

    You can assign the rows using a hash slice or a for loop. You show that you know how to use the loop, so here is a hash slice:

    var %record, @records; @record{@colNames} = split(',', $line); push @records, \%record;

    To extract data from this structure:

    sub total_inventory { my $data = shift; List::Util::sum(map {$_->{'Inventory'}} @$data); }

    The data is kept together (shuffling one column will not break the relationships with the rest of the columns).

    If you use DBI with one of the CSV modules, you may even be able to use some of the function of an SQL sum() call. When (ok, if) your data moves into a database, you can then migrate with (hopefully) less effort.

    --MidLifeXis

Re: CSV Columns: Friend or foe?
by Anonymous Monk on Jul 27, 2010 at 02:14 UTC
    You could do it with an array instead of a hash. You'd have to benchmark it to see whether that makes a noticeable difference. Untested code follows.

    chomp(my $line = <IN>); my @head = split(',', $line); my @cols; my $row = 0; while($line = <IN>) { chomp($line); my @row = split',', $line; $cols[$_][$row] = $row[$_] for 0..$#row; $row++; }

    And if I were really prematurely optimizing this, I'd use 1-@row instead of $#row!
      Hi Sir, In the code what does this IN stands for ?
Re: CSV Columns: Friend or foe?
by Gangabass (Vicar) on Jul 27, 2010 at 04:58 UTC

    What about this way (i'm sure it's slowly):

    foreach my $header (@headers) { push @{ $columns{$header} }, shift @row; }
Re: CSV Columns: Friend or foe?
by suhailck (Friar) on Jul 27, 2010 at 03:42 UTC
    perl -MData::Dumper -e '@header=split ",",<>;while(<>){ @{$AoH[$row++] +}{@header}=split ",";};print Dumper(\@AoH)' filename



    ~suhail
Re: CSV Columns: Friend or foe?
by doug (Pilgrim) on Jul 28, 2010 at 02:23 UTC

    I realize that this is a learning experience, but you've left out a lot of the complexity of CSVs. Namely, commas are ignored when inside double quoted strings. Basically

     a,"b,c",d

    only has 3 values. Using split() is only safe once you've looked for quotes. And since this is a chance for you to learn, I'll leave out any hints.

    Good luck.

    - doug

Re: CSV Columns: Friend or foe?
by Tux (Canon) on Dec 01, 2010 at 13:38 UTC

    Never start down the road leading to failure: parse CSV files with split.

    Text::CSV_XS and Text::CSV already do your quest in an easy and reliable way:

    $ cat test.pl #!/pro/bin/perl use strict; use warnings; use autodie; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1 }); open my $fh, "<", "test.csv"; $csv->column_names ($csv->getline ($fh)); while (my $h = $csv->getline_hr ($fh)) { print "$h->{Name} => $h->{Cost}, $h->{Inventory}\n"; } $ cat test.csv Name,Cost,Inventory pickles,2.99,12 vegemite,4.00,8 nuclear sub,22.50,4 $ perl test.pl pickles => 2.99, 12 vegemite => 4.00, 8 nuclear sub => 22.50, 4 $

    Enjoy, Have FUN! H.Merijn

      And with a very small change, your sum's too:

      my %sum; $csv->column_names ($csv->getline ($fh)); while (my $h = $csv->getline_hr ($fh)) { print "$h->{Name} => $h->{Cost}, $h->{Inventory}\n"; no warnings "numeric"; $sum{$_} += $h->{$_} for $csv->column_names; } print <<EOS; Total Inventory: $sum{Inventory} Total Cost: $sum{Cost} EOS => pickles => 2.99, 12 vegemite => 4.00, 8 nuclear sub => 22.50, 4 Total Inventory: 24 Total Cost: 29.49

      Enjoy, Have FUN! H.Merijn
Re: CSV Columns: Friend or foe?
by JavaFan (Canon) on Dec 01, 2010 at 13:58 UTC
    The runtime on this isn't horrible, but it is still linear in the number of columns (though not so linear in the number of rows).
    Considering you are looping over the rows, how do you expect your run time to not be at least linear in the number of rows?

    Also, considering you are reading in the entire line, you cannot really beat any linearity on the number of columns. In fact, reading in your entire file makes the program at least linear in the size of the data - and considering that each row and each column is represented by at least one character in the data, your running time is at least linear in the product of the number of rows and columns.

    Not that this is bad. All I want to point is that being linear in the number of columns is actually very typical for any program reading tabular data.