in reply to perl performance vs egrep

How about if you do a two-stage process of elimination: (please see the 'Update' first).

while( <IFILE> ) { next unless m/^[CKMPSWY]/; print OFILE unless /^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1/; }

This sort of "optimization" is highly sensitive to the type of data, however, if lines starting with [CKMPSWY] are sparse, it will reject non-matches much faster.

Update: Uggh, my logic is backwards in that snippet. The point is that if you can reject non-matches earlier, with fewer cycles, you save a little time.

I also wanted to mention the "print late" philosophy. If you're IO bound, you might be better off storing a few dozen lines in a scalar and printing just once every few-dozen iterations instead of on every iteration. This will maximize the effectiveness of the OS's buffering.


Dave

Replies are listed 'Best First'.
Re^2: perl performance vs egrep
by demerphq (Chancellor) on Jan 24, 2005 at 13:57 UTC

    If you're IO bound, you might be better off storing a few dozen lines in a scalar and printing just once every few-dozen iterations instead of on every iteration.

    Im confused by this comment. The only reason I can see such a strategy making a difference is that it will reduce the number of print calls. I'd be surprised if user caching is more effective than the caching that perl itself does.

    ---
    demerphq

      print is just plain expensive. Here's a really quickly hacked together benchmark: $ perl -we'open my $zap, ">", "/dev/null"; use Benchmark "cmpthese"; cmpthese -10, { every => sub { for (1..100) { print $zap $_ } }, tens => sub { my $x; for (1..100) { $x .= $_; print $zap substr($x,0,1000,"") unless $x % 10 } } }'

        This must be OS dependent or something... (hard to say as you didnt post the output). On my system print with no cache is much faster:

        open my $zap, ">", "NULL:"; use Benchmark "cmpthese"; cmpthese -10, { every => sub { for (1..100) { print $zap $_ } }, tens => sub { my $x; for (1..100) { $x .= $_; unless ($x % 10) { print $zap $x; $x=""; } } print $zap $x if length $x; }, } __END__ # modfied code as posted above (5.6.1 on W2k) Rate tens every tens 1159/s -- -90% every 11285/s 874% -- # original code (5.6.1 on W2k) Rate tens every tens 1124/s -- -90% every 11219/s 899% -- # 5.8.4 (XP) Rate tens every tens 8305/s -- -73% every 30329/s 265% -- # 5.8.6 No Implicit Sys (XP) Rate tens every tens 9655/s -- -68% every 29755/s 208% --

        At least on Win32 it would seem just using print is much faster.... I guess this could be an example of what Steve Hay was talking about with Win32's realloc being crap and that it makes concatenation unecessarily slow. Update: No, that doesnt make sense, I just tried it with two different perl versions on two different win32 boxes and "every" won every one of them....

        Ignore this for now, its got a bug: "NUL" not "NULL:",as ysth pointed out. Ill redo it when i have access to all those perl versions again

        ---
        demerphq