fast way to split and sum an input file

egunnar has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I was at a party last weekend where someone told me perl was extremely slow and it is wiser to use the shell. Anyways, the guy at the party told me a program he and a coworker (who he claimed knew perl well) wrote which ran in 3 secs in awk and c and 3 minutes in perl. The program parsed a text file which was 1 million lines, grabbed the hundredth field (each field was delimited with commas), and returned the sum of the hundredth field for the file. As you can imagine I was very skeptical of this claim. Not having that much experience in perl, I told one of our senior programs at work and he set out to prove him wrong. The coworker's (he is a better perl programmer than me) perl and c code is below. The best we were able to do is to get perl to run about 1/20 the speed of c. We are running on red hat linux with perl version v5.8.4 and gcc version 3.2.3. We used the linux time command to bench mark our results. As you can see (by the comments) we tried a couple of different ideas. Are we missing anything which would make our program faster? Thanks for your help. Erik

#!/usr/bin/perl -w
#use integer;
my $file = 'test_data.dat';
open (IN,$file) or die "Can't open file - $file - $!";
my $data;
print "Reading: [$file]\n";
my $val = 0;
my @arr;
while (<IN>){
    /(?:\d+,){99}(\d+),/;
    $val += $1;
    #print "$1\n";
    #@arr = split(',', $_);
    #$val = $arr[99];
    #print "$val\n\n";
    print STDERR "Working on: [$.]:[$val]\r" unless ($. % 100_000);
}
print STDERR "Final: [$val]\n";
exit;
[download]

#include "stdio.h"
int main(){

    char buf[100000], *ptr, *ptr2;
    FILE *fptr;
    int count = 0, i;
    long total = 0l;
    fptr = fopen("test_data.dat", "r");
    if (!fptr){
        printf ("Can't open test_data.dat\n");
        exit(1);
    }

    while (fgets(buf,100000,fptr)){
        /* printf("Read: [%s]\n", buf); */
        ptr = buf;
        for (i = 0; i < 99; i++){
            ptr++;
            ptr = index(ptr,',');
            /* printf("Read: [%d]:[%s]\n", i, ptr); */
        }
        ptr++;
        ptr2 = index(ptr,',');
        *ptr2 = '\0';
        total += atol(ptr);
        /* printf("Read: [%s][%ld]\n", ptr, total); */
        count++;
    }

    printf ("Read %d records, total: [%ld]", count, total);
}
[download]

Comment on fast way to split and sum an input file Select or Download Code

Replies are listed 'Best First'.
Re: fast way to split and sum an input file by GrandFather (Saint) on Aug 29, 2005 at 21:19 UTC
There are a number of ways of measuring speed. At the end of the day "time to correct solution" is much more important than "time to execute the program". In this case the (cleaned up) code that you supplied ran in 10 seconds on my system. However the Perl code is about 5 times shorter than the C code so the time to write the Perl code will have been less and the time to debug it will have been much less! What do you like better: coding five minutes + 10 seconds run time coding fifteen minutes + 1 second run time Read more... Test code (880 Bytes) Perl is Huffman encoded by design.	[reply] [d/l]
Re^2: fast way to split and sum an input file by pg (Canon) on Aug 29, 2005 at 21:47 UTC
If this is a program that will be executed repeatedly. I most likely will go with "15 minutes + 1 second". If the performance is important in this case, everybody should be happy to let the guy go with c, as that's the right thing to do. On the other hand, my personal experience tells me that Perl is in general a language with speed. Don't be afraid to declare that even with this incident.	[reply]
Re^3: fast way to split and sum an input file by QM (Parson) on Aug 29, 2005 at 22:01 UTC
For me, performance has to be really important, such as a program that will be running hundreds of times per day. Or for a server, where a small increase in performance means fewer total systems to buy and maintain. The difference between 1 second and 1 minute in the daily grind isn't much at all. Especially considering how long I spend just waiting for some applications to start up in the first place. If the process "multiplier" is only 1 (me, once a day), 1 minute is just long enough for a trip to the soda machine, so I won't mind. Depending on the C compiler, updates can take 10 minutes (compile, link, fix stupid bugs, repeat) compared to Perl. I once had a class project to write 1/3 of a trading system (along with my 2 teammates, who wrote the other 2/3). They used C, I used Perl. One of them implemented the interface wrong, so our project didn't work. We had 20 minutes to deadline, and he was going to need 2 hours to find the bug and fix it. Instead, I changed my Perl version to handle his mistake, and we turned it in 5 minutes after we discovered the problem (including testing). Speed of execution is rarely a requirement. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: fast way to split and sum an input file by ikegami (Patriarch) on Aug 29, 2005 at 21:02 UTC
`[^,]` is slightly (8%) faster than `\d`, and don't forget to comment out `print STDERR "Working on: ..."` from the Perl version since you don't the equivalent in the C code. It's not surprising that Perl is slower. It's no secret. While Perl's speed is decent, it's not its strong point. One strong point is the speed at which programs are written in Perl. For example, it took no time at all to write the following Perl solution to your problem: `perl -ne "END { print(qq{Total: $t\n}); } $t += (split ',')[99]" file` [download]	[reply] [d/l] [select]
Re^2: fast way to split and sum an input file by davidrw (Prior) on Aug 29, 2005 at 21:11 UTC
Another strong point is TMTOWTDI and that perl has lots of cool built-in's (OP--see `perldoc perlrun`) for these types of tasks: `perl -F, -lane '$x+=$F[99]; END{print $x}' data.txt` [download]	[reply] [d/l] [select]
Re: fast way to split and sum an input file by sk (Curate) on Aug 29, 2005 at 20:58 UTC
I have not had a chance to run your code yet but this node should enlighten you. cut vs split (suggestions), which builds on Efficient way to sum columns in a file It is one of the first few posts that i did in the monastery and it has very good replies form our monks! -SK	[reply]
Re: fast way to split and sum an input file by runrig (Abbot) on Aug 30, 2005 at 00:02 UTC
someone told me perl was extremely slow and it is wiser to use the shell. Tell him he's an idiot and it's wiser to use the right tool for the right job (ok, leave that first part out if you want to be diplomatic). ...(They) wrote which ran in 3 secs in awk and c and 3 minutes in perl. I used just a naive split in perl, and got about 25 seconds (perl) for processing 100,000 lines of 100 fields vs. about 10 seconds in awk. The perl took about 45 seconds if I used the "-a" and "-F" options instead of just using `$tot += (split /,/)[99]`. But what does that mean? For something insanely simple like this, I might use awk (and the 'C' just does not look insanely simple, so I don't think I'd do it that way). If I wanted to do something more complicated, like keeping group sums of combinations of other fields, and key it to some other nested data structures, and look up some other data in a database and some other data from the web, then it's wiser to get it done easily in your favorite scripting language. FYI, the awk code (I have an old awk that only allows 99 fields): `#!/usr/bin/awk -f BEGIN { FS="," } { tot += $99 } END { print tot }` [download]	[reply] [d/l] [select]
This is my code by broomberg2 (Initiate) on Aug 30, 2005 at 13:36 UTC
#1 - This is a totally arbitrary BS test. #2 - The right tool for the job argument has no effect gainst people who don't like the language. #3 - The original "challenge" was: Take 1 million records. They will contain numbers and be comma separated and have at least 100 fields. Sum the hundredth field. The so-called expert got it to run 3 seconds in C, 3 seconds in Awk, and 3 MINUTES in Perl. In my historical experience from about 10 years ago, after being a C programmer for 10 years already, most text processing in Perl was quite close to C (timing wise), while being MUCH easier to write and debug. So of course, I called BS. And had to test. The test file is about 486MB. The test system is a Quad Opteron - 2.2Ghz. Data is in Linux cache memory during test. My C code (just optimized, pulled the 1st index out and replaced it with a pointer loop walking the string counting commas) is now down to 1.766 seconds. My simple 2 line AWK is 1 minutes 7 seconds - I say the original AWK statement is a total crock, maybe got confused in whisper down the lane. My Perl code using 5.8 was 54 seconds, and the using 5.6.1 ran in 17 seconds. Good reason to keep the 5.6.1 around. Of course, after coding in C for quite a while, and moving to Perl 10 years ago, I KNOW this is a crock, but for our very large data warehousing code (with lots of text munging that can run for days) we occasionally need to determine if the code effort is worth the trade off. So at this point the difference for this simple test is 10 to 1. I suspect if I had a more text focused task, a bit more complex, a bit hard to optimize in C, to tradeoff would lean to Perl for both speed of execution AND development.	[reply]
Re: fast way to split and sum an input file by Joost (Canon) on Aug 29, 2005 at 22:55 UTC
perl was extremely slow and it is wiser to use the shell. I find that hard to believe. Perl is slower than C, but if you're dealing with string manipulation, it's one of the fastest "scripting languages" around. I wouldn't be too surprised if a small awk script could do this task a bit faster than perl - awk was written for exactly this kind of task. Anyway, you're pitching straight C against perl in your example - I don't see any awk or shell script - and for your particular code, it's not that surprising that the C code is faster - though I'd guess (blindly) that for this kind of task 1/5th of the speed of a C program is attainable. But your C code isn't equivalent to the perl code: AFAIK the C code just reads a 10,000 bytes, reads the 100th field in those 10.000 bytes and gets the next 10,000 bytes. If I read it correctly, it'll even fail to get the 100th field on the first line if the very first field is empty (my C is rusty) update: my C is indeed rusty: you code will work correctly if all lines are less than 10K in length, and they don't start with an empty field. Your perl code on the other hand reads the file by line and gets the 100th field for that line. As others have stated in this thread: one of the benefits of using perl vs C is that it just takes a lot less time to code a correct program in perl vs C - and computers don't get paid by the hour :-) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: fast way to split and sum an input file by saintmike (Vicar) on Aug 29, 2005 at 20:53 UTC
Using perl's index function instead of a regex or split should speed it up considerably.	[reply]
Re^2: fast way to split and sum an input file by ikegami (Patriarch) on Aug 29, 2005 at 21:18 UTC
I didn't think so. I tested to be sure: Read more... (1058 Bytes) `regexp: 509845 index_loop: 509845 Rate index_loop regexp index_loop 12.5/s -- -51% regexp 25.6/s 104% --` [download]	[reply] [d/l] [select]