agregating columns in several csv files

Utilitarian has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have ~3K csv files that I need to aggregate the records of, (sum some columns, find the max for others). The following code fails for some, probably blindingly obvious, reason I have failed to spot. @ARGV is a list of csv files in the current directory, all of which contain the same number of records.

What I thought I wrote
open all files in @ARGV assigning a file handle to each in the @FH array.

While there are unread lines in the first file

Set the totals to the values in the current record of the first file
Read and manipulate the equivalent record for all files in @FH.

#! /usr/bin/perl

use strict;
use warnings;

#GLOBALS
my $DEBUG=1; # 0 no debug messages, 1 light logging, >1 lots of loggin
+g
my (@FH,$TOTALS);

for my $index (0..$#ARGV){
   open ($FH[$index],"<","$ARGV[$index]")||die "cant open $ARGV[$index
+] $!\n";
   print "assigned ",$index-1," to $ARGV[$index]\n if $DEBUG;";
}

open($TOTALS ,">","totals.csv");

while (<{$FH[0]}>){
   chomp;
   print "Current record is |$_|" if $DEBUG;
   my @totals=split; #space seperated
   for my $FH (@FH[1..$#FH]){
      my $record=readline($FH)||warn "cannot read file $!\n";
      chomp($record);
      print "$record\n" if ($DEBUG>1);
      my @record=split(/ +/,$record);
      $totals[1]+=$record[1];#total read volume
      $totals[2]+=$record[2];#total write volume
      $totals[3]=$record[3] if $totals[3]<$record[3];#Return max respo
+nse time
      $totals[4]=$record[4] if $totals[4]<$record[4];#Return max % wai
+ting
      $totals[5]=$record[5] if $totals[5]<$record[5];#Return max % blo
+cked
      $totals[6]+=$record[6]; # total errors
      $totals[7]+=$record[7]; # total errors
      $totals[8]+=$record[8]; # total errors
      $totals[9]+=$record[9]; # total errors
      $totals[10]+=$record[10]; # total errors
      $totals[11]+=$record[11]; # total errors
   print "totalling record $record[0] for $#FH files\n" if $DEBUG;
   }
   print $TOTALS "@totals\n";
}
[download]

The while loop doesn't run over the whole file, it just runs once.

The $record value returned in the debug check is

Current record is |GLOB(0x9d93e48)| 0
[download]

Comment on agregating columns in several csv files Select or Download Code

Replies are listed 'Best First'.
Re: agregating columns in several csv files by Anonymous Monk on Jun 05, 2009 at 07:58 UTC
You need readline, not glob :) `C:\>perl -MO=Deparse,-p -e"while (<{$FH[0]}>){print} use File::Glob (); while (defined(($_ = glob((('{' . $FH[0]) . '}'))))) { print($_); } -e syntax OK C:\>perl -MO=Deparse,-p -e"while(readline $FH[0]){print} while (defined(($_ = readline($FH[0])))) { print($_); } -e syntax OK` [download]	[reply] [d/l]
Re^2: agregating columns in several csv files by Utilitarian (Vicar) on Jun 05, 2009 at 08:09 UTC
Muy gracias, Thanks, for the advice, I was doing my nut on this one as I didn't see where the error was arising.	[reply]
Re: agregating columns in several csv files by Corion (Patriarch) on Jun 05, 2009 at 07:50 UTC
I don't see at first glance why your program would only loop once, but it seems to me that `$_` is not set to a line from the file but a glob. I don't see why that is though. But you'll encounter other problems anyway. Most operating systems have a limit on the number of files a program can have open at the same time. This limit is usually around 250 or around 1020, so you'll be hitting that limit with your 3000 files. I would import all the CSV files into a database, for example SQLite is very convenient for that. If you can't even install SQLite anywhere, you can potentially even get along using a BTree database like DB_File, as your key is only a single column. If you have the data in an SQL database, calculating the totals etc. becomes trivial, as SQL has the `sum()` and `max()` aggregates.	[reply] [d/l] [select]
Re^2: agregating columns in several csv files by Utilitarian (Vicar) on Jun 05, 2009 at 07:58 UTC
Hi Corion, thanks for the quick reply, however MaxFDLimit ~= 65K ulimit -n 4096 So I'm not hitting the OS FD limit, was hitting my shells limit until I adjusted ulimit -n. If it were in a database this would be easier, however I'm restricted in the tools to hand for this,(Can't take data off the server). It's also worth noting that I can only use core modules as i can't install anything on the server :(	[reply]
Re^3: agregating columns in several csv files by Corion (Patriarch) on Jun 05, 2009 at 08:03 UTC
Anonymous Monk has found why the diamond operator does not work. DB_File or any other BTree storage likely is installed already with your Perl, like DBM_File or SDBM_File, and they can be used to conveniently rearrange the data and adress it by a common key. Have you considered concatenating all your CSV files into a single file? If you then sort that single file by the columns you want to use as keys, all your calculations become much simpler as you only have one set of variables that you purge whenever your key changes. I will mention Yes, even you can use CPAN, but it's only for future reference as DBD::SQLite will need a C compiler, so you can't completely paste it into your script.	[reply]
Re^3: agregating columns in several csv files by Anonymous Monk on Jun 05, 2009 at 08:04 UTC
Yes, even you can use CPAN. You can also inline module Text::CSV_PP using Inline::Module, and you get all the benefits of an actively maintained module, with good kwalitee.	[reply]
Re: agregating columns in several csv files by NiJo (Friar) on Jun 06, 2009 at 10:40 UTC
Your problem seems to be solved by now, but some issues come to my mind: 1) The length of command lines (@ARGV) is most likely limited to about 512 chars and would not contain 3k of file names. I'd check if totals.csv matches the output of the calling program. 2) Concatenating all files before agregating makes your problem go away. 3) Buffering @ARGV into total.csv is not required. 4) Move opening and closing files into the outer loop. This way you have just one file open at a time. This is the standard approach. 5) Depending on the size of files, available RAM and your performance requirements, it might make sense to 'slurp' input files into memory in one go (or one by one). Your agregation then works on a 2D array.	[reply]