Efficient way to sum columns in a file

sk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I was watching someone trying to find the sum of a particular column in a file. He was taking it a data-application (meant to work with large files) to calculate this sum.

It takes a while to setup the app as you have to read the entire file and give it variable names etc. So i wrote this one liner which worked great (saved a lot of time).

cut -d, -f7 in.csv| perl -nle '$sum += $_; print ("Sum = $sum") if eof
+;'
[download]

I could avoid the "cut" but i didn't see a huge advantage (if someone can give me good reasons to avoid cut that will be nice!)....

Please note that the file is pretty large (around 5 million rows and a few hundred columns)... Since it worked out well, that person asked me how to modify the code to make it work for 5 columns. I immediately used an array (return from a split /,/)and looped through the list to get the sum of the columns every time a new row is sent in! Little did I realize at the time of writing that this will have horrible performance... After letting it run for a few minutes I realized that looping many times (millions!) is not such a good idea (bad idea rather?)... Maybe just declaring 5 variables would have been better...

So my question is how would Monks handle such a problem?

Thanks all for your time!

-SK

Comment on Efficient way to sum columns in a file Select or Download Code

Replies are listed 'Best First'.
Re: Efficient way to sum columns in a file by dave0 (Friar) on Apr 13, 2005 at 05:08 UTC
If you want to get rid of the 'cut', you can do it all in Perl using the -a and -F switches (see perlrun for details) to autosplit the input line into the @F array: `perl -lanF, -e '$sum += $F[6]; print "Sum is $sum" if eof' in.csv` [download]	[reply] [d/l]
Re^2: Efficient way to sum columns in a file by sk (Curate) on Apr 13, 2005 at 05:44 UTC
Thanks dave! I have never used aF switches before. I shall remember to use this instead of my `split/,/;` However I am not sure if perl's split is more efficient than cut... I guess I can do a quick field test on a large file! -SK	[reply] [d/l]
Re: Efficient way to sum columns in a file by tlm (Prior) on Apr 13, 2005 at 11:55 UTC
By testing `eof()` instead of `eof` in dave0's one-liner, you can extend its functionality to sum across multiple files. Like this: # First we make to test files % perl -le 'print join(",", map int(rand(1000)), 1..10) for 1..50' > i +n1.csv % perl -le 'print join(",", map int(rand(1000)), 1..10) for 1..50' > i +n2.csv % head -3 in1.csv 94,434,249,267,649,367,572,579,498,369 452,735,420,543,832,198,28,86,67,382 801,339,978,859,85,719,758,89,191,377 # Now we run dave0's original script on them perl -lanF, -e '$sum += $F[6]; print "Sum is $sum" if eof' in1.csv in2 +.csv Sum is 23789 Sum is 48874 # Now let's try a small change --------------------------. # \| # V perl -lanF, -e '$sum += $F[6]; print "Sum is $sum" if eof()' in1.csv i +n2.csv Sum is 48874 [download] What happens is that `eof` (no parens!) returns true at the end of each file in @ARGV, while `eof()` is true only at the end of the last one (see `eof`). In the case of a single input file, the behavior is the same as before. To get subtotals, you'd do `perl -lanF, -e '$total += $F[6]; $sub += $F[6]; print "Subtotal: $sub" + and $sub = 0 if eof; print "Total: $total" if eof()' in1.csv in2.csv Subtotal: 23789 Subtotal: 25085 Total: 48874` [download] An alternative to testing for `eof()`, is to put everything in an `END`. e.g. `perl -lanF, -e '$sum += $F[6]; END{ print "Sum is $sum" }' in1.csv in2 +.csv` [download] Update: Bug in last one-liner corrected. the lowliest monk	[reply] [d/l] [select]
Re^2: Efficient way to sum columns in a file by Anonymous Monk on Apr 13, 2005 at 13:48 UTC
I don't think the '$total += $sub += $F6;' part is doing what you expect it to... Actually, it doesn't add $F6 to both $total and $sub, but rather add it to $sub and then add $sub to $total, which generates a wrong total.	[reply]
Re^3: Efficient way to sum columns in a file by tlm (Prior) on Apr 13, 2005 at 14:04 UTC
The totals was obviously wrong; that should have alerted me to the error. Thank you very much! the lowliest monk	[reply]
Re: Efficient way to sum columns in a file by Random_Walk (Prior) on Apr 13, 2005 at 12:58 UTC
I generated 500,000 lines of random CSV with this script `#!/usr/bin/perl use strict; use warnings; # create source numbers if they don't exist my $many = 500000; my $source='numbers.csv'; open CSV, '>', $source or die "can't write to $source: $!\n"; for (1..$many) { my @numbers; push @numbers, (int rand 1000) for (1..5); print CSV join ",",@numbers; print CSV $/; }` [download] Then I tried a few one liners to sum the columns, I ran each twice and post the second timing to allow for cache nph>time cat numbers.csv \| perl -nle'@d=split /,/;$a[$_]+=$d[$_] for ( +0..4);END{print join "\t", @a}' 249959157 249671314 249649377 250057435 249420 +634 real 0m17.10s user 0m15.46s sys 0m0.08s nph>time perl -nle'my @d=split /,/;$a[$_]+=$d[$_] for (0..4);END{print + join "\t", @a}' numbers.csv 249959157 249671314 249649377 250057435 249420 +634 real 0m13.71s user 0m12.77s sys 0m0.04s nph>time perl -nle'my($a,$b,$c,$d,$e)=split /,/;$ta+=$a, $tb+=$b, $tc+ +=$c, $td+=$d, $te+=$e;END{print join "\t", $ta,$tb,$tc,$td,$te}' numb +ers.csv 249959157 249671314 249649377 250057435 249420 +634 real 0m6.45s user 0m5.91s sys 0m0.07s [download] The last one was consistently faster after several attempts with it and the second. Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l] [select]
Re^2: Efficient way to sum columns in a file by sk (Curate) on Apr 13, 2005 at 18:12 UTC
Thanks all for your comments! As expected the looping idea is very slow (Per Random_Walk's results) and I guess we are better off "generating" another perl script with many variables based on the number of columns required. This might not look pretty but seems to be the most efficient way to do it. Also, I was curious to see the impact of `cut` and Perl's `split`. So I tested these two commands/script on 500K file generated using (R's code)...However I output 25 columns instead of 4 and cut out 15 columns for testing `[sk]% time cut -d, -f"1-15" numbers.csv > out.csv 5.670u 0.340s 0:06.27 95.8%` [download] `[sk]% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > o +ut.csv 31.950u 0.200s 0:32.26 99.6%` [download] I am surprised that Perl's `split` is very slow when compared to UNIX built in `cut`. Is this because Perl's `split` does a lot more than the Unix's `cut`? I see a lot of use cases for Perl in handling large files but if parsing is a bottle neck then I need to be careful on when to use it. Thanks again everyone! I enjoyed reading the replies. Esp i liked the explanantions on eof and eof() (very good example to demonstrate the diff) and also the END {} idea :) cheers SK	[reply] [d/l] [select]
Re^2: Efficient way to sum columns in a file by Roy Johnson (Monsignor) on Apr 13, 2005 at 20:40 UTC
Although still slower than having separate variables for each column, this was a bit faster (and shorter) than your middle test: `time perl -nle'my $i=0;$a[$i++]+=$_ for (split/,/);END{print join "\t" +, @a}' numbers.csv` [download] At only about 50% slower than the fastest solution, its brevity and adaptability might recommend it. Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re: Efficient way to sum columns in a file by eibwen (Friar) on Apr 13, 2005 at 06:51 UTC
With regard to your creation of an array and subsequent iteration, it is possible to index the array returned by split and therefore avoid the subsequent iteration: `$sum += (split(/,/))[5];` While indexing in this fashion is useful, I think the `@F` method would be faster in this case.	[reply] [d/l] [select]