Utilitarian has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have ~3K csv files that I need to aggregate the records of, (sum some columns, find the max for others). The following code fails for some, probably blindingly obvious, reason I have failed to spot. @ARGV is a list of csv files in the current directory, all of which contain the same number of records.

What I thought I wrote
open all files in @ARGV assigning a file handle to each in the @FH array.

While there are unread lines in the first file

#! /usr/bin/perl use strict; use warnings; #GLOBALS my $DEBUG=1; # 0 no debug messages, 1 light logging, >1 lots of loggin +g my (@FH,$TOTALS); for my $index (0..$#ARGV){ open ($FH[$index],"<","$ARGV[$index]")||die "cant open $ARGV[$index +] $!\n"; print "assigned ",$index-1," to $ARGV[$index]\n if $DEBUG;"; } open($TOTALS ,">","totals.csv"); while (<{$FH[0]}>){ chomp; print "Current record is |$_|" if $DEBUG; my @totals=split; #space seperated for my $FH (@FH[1..$#FH]){ my $record=readline($FH)||warn "cannot read file $!\n"; chomp($record); print "$record\n" if ($DEBUG>1); my @record=split(/ +/,$record); $totals[1]+=$record[1];#total read volume $totals[2]+=$record[2];#total write volume $totals[3]=$record[3] if $totals[3]<$record[3];#Return max respo +nse time $totals[4]=$record[4] if $totals[4]<$record[4];#Return max % wai +ting $totals[5]=$record[5] if $totals[5]<$record[5];#Return max % blo +cked $totals[6]+=$record[6]; # total errors $totals[7]+=$record[7]; # total errors $totals[8]+=$record[8]; # total errors $totals[9]+=$record[9]; # total errors $totals[10]+=$record[10]; # total errors $totals[11]+=$record[11]; # total errors print "totalling record $record[0] for $#FH files\n" if $DEBUG; } print $TOTALS "@totals\n"; }
The while loop doesn't run over the whole file, it just runs once.

The $record value returned in the debug check is

Current record is |GLOB(0x9d93e48)| 0

Replies are listed 'Best First'.
Re: agregating columns in several csv files
by Anonymous Monk on Jun 05, 2009 at 07:58 UTC
    You need readline, not glob :)
    C:\>perl -MO=Deparse,-p -e"while (<{$FH[0]}>){print} use File::Glob (); while (defined(($_ = glob((('{' . $FH[0]) . '}'))))) { print($_); } -e syntax OK C:\>perl -MO=Deparse,-p -e"while(readline $FH[0]){print} while (defined(($_ = readline($FH[0])))) { print($_); } -e syntax OK
      Muy gracias, Thanks, for the advice, I was doing my nut on this one as I didn't see where the error was arising.
Re: agregating columns in several csv files
by Corion (Patriarch) on Jun 05, 2009 at 07:50 UTC

    I don't see at first glance why your program would only loop once, but it seems to me that $_ is not set to a line from the file but a glob. I don't see why that is though. But you'll encounter other problems anyway.

    Most operating systems have a limit on the number of files a program can have open at the same time. This limit is usually around 250 or around 1020, so you'll be hitting that limit with your 3000 files.

    I would import all the CSV files into a database, for example SQLite is very convenient for that. If you can't even install SQLite anywhere, you can potentially even get along using a BTree database like DB_File, as your key is only a single column.

    If you have the data in an SQL database, calculating the totals etc. becomes trivial, as SQL has the sum() and max() aggregates.

      Hi Corion, thanks for the quick reply, however
      MaxFDLimit ~= 65K
      ulimit -n 4096
      So I'm not hitting the OS FD limit, was hitting my shells limit until I adjusted ulimit -n.

      If it were in a database this would be easier, however I'm restricted in the tools to hand for this,(Can't take data off the server). It's also worth noting that I can only use core modules as i can't install anything on the server :(

        Anonymous Monk has found why the diamond operator does not work.

        DB_File or any other BTree storage likely is installed already with your Perl, like DBM_File or SDBM_File, and they can be used to conveniently rearrange the data and adress it by a common key.

        Have you considered concatenating all your CSV files into a single file? If you then sort that single file by the columns you want to use as keys, all your calculations become much simpler as you only have one set of variables that you purge whenever your key changes.

        I will mention Yes, even you can use CPAN, but it's only for future reference as DBD::SQLite will need a C compiler, so you can't completely paste it into your script.

Re: agregating columns in several csv files
by NiJo (Friar) on Jun 06, 2009 at 10:40 UTC
    Your problem seems to be solved by now, but some issues come to my mind:

    1) The length of command lines (@ARGV) is most likely limited to about 512 chars and would not contain 3k of file names. I'd check if totals.csv matches the output of the calling program.

    2) Concatenating all files before agregating makes your problem go away.

    3) Buffering @ARGV into total.csv is not required.

    4) Move opening and closing files into the outer loop. This way you have just one file open at a time. This is the standard approach.

    5) Depending on the size of files, available RAM and your performance requirements, it might make sense to 'slurp' input files into memory in one go (or one by one). Your agregation then works on a 2D array.