Re: Efficient way to sum columns in a file

I generated 500,000 lines of random CSV with this script

#!/usr/bin/perl
use strict;
use warnings;

# create source numbers if they don't exist
my $many = 500000;
my $source='numbers.csv';
open CSV, '>', $source or die "can't write to $source: $!\n";
for (1..$many) {
    my @numbers;
    push @numbers, (int rand 1000) for (1..5);
    print CSV join ",",@numbers;
    print CSV $/;
}
[download]

Then I tried a few one liners to sum the columns, I ran each twice and post the second timing to allow for cache

nph>time cat numbers.csv | perl -nle'@d=split /,/;$a[$_]+=$d[$_] for (
+0..4);END{print join "\t", @a}'
249959157       249671314       249649377       250057435       249420
+634

real    0m17.10s
user    0m15.46s
sys     0m0.08s

nph>time perl -nle'my @d=split /,/;$a[$_]+=$d[$_] for (0..4);END{print
+ join "\t", @a}' numbers.csv
249959157       249671314       249649377       250057435       249420
+634

real    0m13.71s
user    0m12.77s
sys     0m0.04s

nph>time perl -nle'my($a,$b,$c,$d,$e)=split /,/;$ta+=$a, $tb+=$b, $tc+
+=$c, $td+=$d, $te+=$e;END{print join "\t", $ta,$tb,$tc,$td,$te}' numb
+ers.csv
249959157       249671314       249649377       250057435       249420
+634

real    0m6.45s
user    0m5.91s
sys     0m0.07s
[download]

The last one was consistently faster after several attempts with it and the second.

Cheers,
R.

Pereant, qui ante nos nostra dixerunt!

Comment on Re: Efficient way to sum columns in a file Select or Download Code

Replies are listed 'Best First'.
Re^2: Efficient way to sum columns in a file by sk (Curate) on Apr 13, 2005 at 18:12 UTC
Thanks all for your comments! As expected the looping idea is very slow (Per Random_Walk's results) and I guess we are better off "generating" another perl script with many variables based on the number of columns required. This might not look pretty but seems to be the most efficient way to do it. Also, I was curious to see the impact of `cut` and Perl's `split`. So I tested these two commands/script on 500K file generated using (R's code)...However I output 25 columns instead of 4 and cut out 15 columns for testing `[sk]% time cut -d, -f"1-15" numbers.csv > out.csv 5.670u 0.340s 0:06.27 95.8%` [download] `[sk]% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > o +ut.csv 31.950u 0.200s 0:32.26 99.6%` [download] I am surprised that Perl's `split` is very slow when compared to UNIX built in `cut`. Is this because Perl's `split` does a lot more than the Unix's `cut`? I see a lot of use cases for Perl in handling large files but if parsing is a bottle neck then I need to be careful on when to use it. Thanks again everyone! I enjoyed reading the replies. Esp i liked the explanantions on eof and eof() (very good example to demonstrate the diff) and also the END {} idea :) cheers SK	[reply] [d/l] [select]
Re^2: Efficient way to sum columns in a file by Roy Johnson (Monsignor) on Apr 13, 2005 at 20:40 UTC
Although still slower than having separate variables for each column, this was a bit faster (and shorter) than your middle test: `time perl -nle'my $i=0;$a[$i++]+=$_ for (split/,/);END{print join "\t" +, @a}' numbers.csv` [download] At only about 50% slower than the fastest solution, its brevity and adaptability might recommend it. Caution: Contents may have been coded under pressure.	[reply] [d/l]