in reply to fast way to split and sum an input file

someone told me perl was extremely slow and it is wiser to use the shell.

Tell him he's an idiot and it's wiser to use the right tool for the right job (ok, leave that first part out if you want to be diplomatic).

...(They) wrote which ran in 3 secs in awk and c and 3 minutes in perl.

I used just a naive split in perl, and got about 25 seconds (perl) for processing 100,000 lines of 100 fields vs. about 10 seconds in awk. The perl took about 45 seconds if I used the "-a" and "-F" options instead of just using $tot += (split /,/)[99].

But what does that mean? For something insanely simple like this, I might use awk (and the 'C' just does not look insanely simple, so I don't think I'd do it that way). If I wanted to do something more complicated, like keeping group sums of combinations of other fields, and key it to some other nested data structures, and look up some other data in a database and some other data from the web, then it's wiser to get it done easily in your favorite scripting language.

FYI, the awk code (I have an old awk that only allows 99 fields):

#!/usr/bin/awk -f BEGIN { FS="," } { tot += $99 } END { print tot }

Replies are listed 'Best First'.
This is my code
by broomberg2 (Initiate) on Aug 30, 2005 at 13:36 UTC
    #1 - This is a totally arbitrary BS test.
    #2 - The right tool for the job argument has no effect gainst people who don't like the language.
    #3 - The original "challenge" was:
    Take 1 million records.
    They will contain numbers and be comma separated and have at least 100 fields.
    Sum the hundredth field.

    The so-called expert got it to run 3 seconds in C, 3 seconds in Awk, and 3 MINUTES in Perl.

    In my historical experience from about 10 years ago, after being a C programmer for 10 years already, most text processing in Perl was quite close to C (timing wise), while being MUCH easier to write and debug.

    So of course, I called BS. And had to test.

    The test file is about 486MB.
    The test system is a Quad Opteron - 2.2Ghz.
    Data is in Linux cache memory during test.

    My C code (just optimized, pulled the 1st index out and replaced it with a pointer loop walking the string counting commas) is now down to 1.766 seconds.

    My simple 2 line AWK is 1 minutes 7 seconds - I say the original AWK statement is a total crock, maybe got confused in whisper down the lane.

    My Perl code using 5.8 was 54 seconds, and the using 5.6.1 ran in 17 seconds. Good reason to keep the 5.6.1 around.

    Of course, after coding in C for quite a while, and moving to Perl 10 years ago, I KNOW this is a crock, but for our very large data warehousing code (with lots of text munging that can run for days) we occasionally need to determine if the code effort is worth the trade off.

    So at this point the difference for this simple test is 10 to 1. I suspect if I had a more text focused task, a bit more complex, a bit hard to optimize in C, to tradeoff would lean to Perl for both speed of execution AND development.