in reply to Simple line parse question

Thanks for comments.

My goal is basically to achieve similar timing to the following in awk:

echo "a b c d e f g" | awk '{print $3$4}'

I find it hard to believe that my split&join solution is the fastest perl has to offer to achieve this. This one little line of code in my script is actually turning out to be quite the performance hotspot. So i thought, why not ask here to see if there's a faster way that i'm unaware of. I've already try the following, but they're all slower than my split&join:

1: ... | perl -ne 'printf("%s%s", (split(" ", $_, 5))[2,3]);' 2: ... | perl -ne 'print /(?:\S+ ){2}(\S+) (\S+)/ 3: ... | perl -ane 'print "$F[2]$F[3]";' 4: I even wrote my own subroutine using index/substr to extract what i + need ...

I guess i'm hoping someone will introduce me to a new technique. We can't let the awk'ers have this one so easily can we?

Replies are listed 'Best First'.
Re^2: Simple line parse question
by roboticus (Chancellor) on Aug 07, 2010 at 13:03 UTC

    jimmy.pl:

    We can't let the awk'ers have this one so easily can we?

    Keep in mind that awk is a more specialized tool than perl, so it's really not important if awk can do some things faster than perl. It's fine to care about runtime speed, but it can waste your time. Until a program must be faster, spending time optimizing it is simply a waste of your own time. If you enjoy working overtime, then have at it. But I find it better to spend that time with family, friends, goofing off, etc.

    Remember: first make it work. Then make it work correctly. Next, check if it meets requirements. If, and only if, it fails to meet speed requirements, make it faster.

    ...roboticus

    Assembly language: Fun and runs fastest!. I haven't had to use it since around 1995.

    C/C++: Fun and runs fast! I use it for everything I need to make faster.

    Perl: Fun and fastest to write! Fast enough runtime for 95+% of everything I do.

      I think that roboticus is "on it"!

      From my experience, the coding efficiency of Perl vs C is in the range of 3x-10x:1. Recoding a 5 page C program into a one page Perl program that achieves the same functionality would not be a surprising result.

      The Perl program will run at something like <1/3 the speed of the C program, but often (and VERY often), this does not matter at all! Perl OO vs say C++ is a different thing and it has an additional performance penalty.

      My only slight "nit" with this would be about assembly. In the past decade, the C "super optimizing" compilers have become so good, that you have to be a real guru at ASM to beat them. It is possible to do for very focused tasks, but it is certainly not easy! Some folks can actually wind up writing slower ASM code than the compiler can do.

        Marshall:

        Your experience of 3x-10x:1 feels about right to me.

        Regarding assembly language: I'll have to agree ... mostly.

        For embedded systems with small processors, you can still beat a compiler once you know enough about the chip and the application. On larger computers with modern CPUs, when starting from scratch, you're dead on--it's hard to beat a good compiler. However, when I dive into assembly language on a modern CPU, I don't start from scratch, I let the compiler generate the first pass for me. Then I dig into the processor manuals, and examine the algorithm and then improve the code from there. (And only if I can find a bottleneck I can reasonably expect to improve.)

        When I started programming in assembler (Z-80, 6502, 68000, 8051 days), it was easy to beat a compiler because compilers weren't that good (generally), and the CPU timings were easy to understand. You could read the instruction timings and generally have a good shot at improving the speed on your first attempts. This is still true on most small embedded systems (AVR, PIC, etc.).

        Once caching became popular, things started to get "interesting". You had to understand how your code, the cache and history affected things. With caching, the timings became a bit trickier, as code and data access speed changed. So you could make a decent guess about how to improve the algorith, but you could easily be surprised when things didn't work the way you guessed they would. You needed more guesswork and insight into your algorithm to make significant speed improvements. So larger embedded systems using these processors (80386, ARM, etc.) are harder to improve, and the compilers for them tend to be smarter.

        When the Pentium came out with the I and V pipes, things got downright hard. There was enough complexity in timing, multiple cache levels and figuring out how to maximize the use of the I and V pipelines that guessing how to improve execution speed involves a lot of guesswork. At this time, you'd have to really chew on the problem, and you had to measure things frequently. You could no longer rely on the instruction timings to get you a good guess unless you really had a feel for how everything interacted. Even then it was finicky. Of course these processors are so fast that dropping down into assembler is much less common.

        Now with speculative execution, branch prediction, etc., I'm not sure I could beat a compiler. And even if I did, it would be the result of many guesses and experiments. I've not had a need to improve the code for any CPU more advanced than a Pentium II, so I don't know how much more difficult it is to optimize code in that environment. I guess I'll have to find some time to play around with my Atom, Athlon XP and Pentium IV computers and see what I can do...

        And I've totally ignored the introduction of multitasking, too. Once that came in, you had no idea how other applications were going to impact yours. Compilers, too, became a lot better. Now, on the (very rare) occasions when I drop into assembly, it's either an embedded system that's easy to understand and where I *truly* need the speed, or it's just (a) pure fun, (b) a challenge for myself, and/or (c) an exercise to keep myself sharp.

        ...roboticus

Re^2: Simple line parse question
by Anonymous Monk on Aug 07, 2010 at 11:46 UTC
    I find it hard to believe that my split&join solution is the fastest perl has to offer to achieve this.

    Believe, is that Swahili for Benchmark?

Re^2: Simple line parse question
by Marshall (Canon) on Aug 07, 2010 at 12:54 UTC
    Whoa!
    This is very "awk_weird"
    Give us an input file and an expected result.

      You can generate the input yourself. For example:

      xxx@xxx:~/test/perl$ seq 100 1000000 | perl -ne 'print int(rand($_)), +"\n"' | xargs -n10 echo > a xxx@xxx:~/test/perl$ wc -l a 99991 a xxx@xxx:~/test/perl$ for i in {1..100}; do cat a; done > b xxx@xxx:~/test/perl$ wc -l b 9999100 b xxx@xxx:~/test/perl$ cat b | time -p awk '{print $3$4}' > /dev/null real 8.78 user 7.89 sys 0.38 xxx@xxx:~/test/perl$ cat b | time -p perl -ne 'print join("", (split(" + ", $_, 5))[2,3]),"\n";' > /dev/null real 13.78 user 12.93 sys 0.32

        This file "b" is a pretty huge thing.
        It is so big that I can't run your script to completion without exceeding my disk quota on the Linux machine that I have access to.

        However, a typical line has 10 integers on it. On the machine I tested on, Perl can process 300,000 lines like that in 0.8 seconds. You are not using the "power of Perl". Perl combines the ease of use of a scripting language with the execution efficiency of a compiled language.

        I love 'C' and I'm pretty good at assembly when I have to do it, BUT for just a few lines of code that can process more than 300K lines per second, I don't see the need for either.

        #!/usr/bin/perl -w use strict; use Benchmark; #file: jimmy.pl timethese (5, { jimmy => q{ jimmy(); }, } ); sub jimmy { open (IN, '<', "b") or die; open (OUT, '>', "/dev/null") or die; my $numlines =0; while (<IN>) { next if /^\s+$/; # skip blank lines my @words = split; next if (@words <4); # something strange here # happens just a very, very few times but # there is a flaw in "b" file generation + print OUT @words[2,3], "\n"; $numlines++; } print "num lines read = $numlines\n"; }; __END__ [prompt]$ jimmy.pl Benchmark: timing 5 iterations of jimmy... num lines read = 299970 num lines read = 299970 num lines read = 299970 num lines read = 299970 num lines read = 299970 jimmy: 6 wallclock secs ( 6.29 usr + 0.05 sys = 6.34 CPU) @ 0 +.79/s (n=5)
        Update: To do the simple math:
        0.8 sec / 300K ~ x/1000K
        x ~ 2.7 sec
        That's approx 374,000 lines per second.
        And that appears to me, to be very fast.
        At that rate, 12 seconds could process > 4 billion lines (not bytes).

        Your benchmark is not realistic.
        xxx@xxx:~/test/perl$ wc -l b
        9999100 b
        "b" is a file with 9,999,100 LINES in it or about 10 million.
        How often do you actually process a single file containing 10 million lines?
        I think that this is very rare!

        If you want to benchmark Perl vs some awk thing, get realistic!