in reply to RE performance

Try it without the () in your first regex. The RE engine has to do a bit of setup work every time it enters or leaves a capturing () set during its passage through the candidate string, so they add a lot of overhead.

Also, your "||" version isn't very fair, since the /est/ always matches, meaning the /\d/ test is never made. But it is interesting how much faster the "constant" match for est is as opposed to the match for est|\d.

Oh, and use Benchmark:

#!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); my $i=0; sub capture { "test$i"=~/(est|\d)/; } sub noCapture { "test$i"=~/(?:est|\d)/; } sub noParens { "test$i"=~/est|\d/; } sub twoRegexen { "test$i"=~/est/ || "test$i"=~/\d/; } cmpthese(-3, { capture => \&capture, noCapture => \&noCapture, noParens => \&noParens, twoRegexen => \&twoRegexen, } );
Results (with 5.6.1, I'm sure 5.8.0 would be somewhat different):
$ perl testRegexen.pl Benchmark: running capture, noCapture, noParens, twoRegexen, each for +at least 3 CPU seconds... capture: 4 wallclock secs ( 3.18 usr + 0.00 sys = 3.18 CPU) @ 72 +8822.24/s (n=2320570) noCapture: 2 wallclock secs ( 3.10 usr + 0.01 sys = 3.11 CPU) @ 82 +7692.26/s (n=2576606) noParens: 4 wallclock secs ( 3.14 usr + 0.01 sys = 3.15 CPU) @ 87 +1936.55/s (n=2748344) twoRegexen: 4 wallclock secs ( 3.41 usr + -0.00 sys = 3.41 CPU) @ 14 +78048.90/s (n=5047537) Rate capture noCapture noParens twoRegexen capture 728822/s -- -12% -16% -51% noCapture 827692/s 14% -- -5% -44% noParens 871937/s 20% 5% -- -41% twoRegexen 1478049/s 103% 79% 70% --

--
Mike

Edit: Oops, when I added the noParens case I forgot to put it into the cmpthese call here, although I'd put it into my test script.

Replies are listed 'Best First'.
Re: Re: RE performance
by Zaxo (Archbishop) on Sep 04, 2002 at 14:19 UTC

    On perl 5.8.0 i686-linux-thread-multi with debugging:

    $ perl testRegexen.pl Benchmark: running capture, noCapture, noParens, twoRegexen for at lea +st 3 CPU seconds... capture: 4 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 14 +1750.00/s (n=453600) noCapture: 3 wallclock secs ( 3.04 usr + 0.02 sys = 3.06 CPU) @ 27 +9933.01/s (n=856595) noParens: 1 wallclock secs ( 3.34 usr + 0.00 sys = 3.34 CPU) @ 26 +1686.53/s (n=874033) twoRegexen: 5 wallclock secs ( 3.41 usr + 0.01 sys = 3.42 CPU) @ 54 +0171.35/s (n=1847386) Rate capture noParens noCapture twoRegexen capture 141750/s -- -46% -49% -74% noParens 261687/s 85% -- -7% -52% noCapture 279933/s 97% 7% -- -48% twoRegexen 540171/s 281% 106% 93% -- $

    Update: perl -Dr output, per request:

    After Compline,
    Zaxo

      Thanks!

      Isn't it strange that the noCapture case is faster than the noParens??

      Here's what use re 'debug' gets me for the 4 regex variants:

      Compiling REx `(est|\d)' size 10 first at 3 1: OPEN1(3) 3: BRANCH(6) 4: EXACT <est>(8) 6: BRANCH(8) 7: DIGIT(8) 8: CLOSE1(10) 10: END(0) minlen 1 Compiling REx `(?:est|\d)' size 7 1: BRANCH(4) 2: EXACT <est>(7) 4: BRANCH(6) 5: DIGIT(7) 6: TAIL(7) 7: END(0) minlen 1 Compiling REx `est|\d' size 6 1: BRANCH(4) 2: EXACT <est>(6) 4: BRANCH(6) 5: DIGIT(6) 6: END(0) minlen 1 Compiling REx `est' size 3 first at 1 1: EXACT <est>(3) 3: END(0) anchored `est' at 0 (checking anchored isall) minlen 3 Compiling REx `\d' size 2 first at 1 1: DIGIT(2) 2: END(0) stclass `DIGIT' minlen 1
      The TAIL(7) on the noCapture case looks redundant to me; maybe that's something they fixed in 5.8.0?

      What happens if you look at those regexes under 5.8.0 with -Dr or use re 'debug', please?

      ( Edit: Zaxo did this just above, and got the same results, except he gets offsets. No idea what those mean...)
      --
      Mike

        "TAIL" is the node that defines the end of a non-capturing parenthetical group or an if-then parenthetical group (like (?(1)a|b)). That's about all. It gets optimized out for non-capturing parens if there is no "BRANCH" found (that is, no "|" metacharacter), so that something like /A(?:B)C/ can be optimized to an "EXACT" node matching "ABC".

        Basically, "TAIL" takes the place of having another "indented" layer.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        I think I got a line on that one (i finally read perldebguts), and I quote (cause I likes to ;)
        
            # Do nothing
            NOTHING     no      Match empty string.
            # A variant of above which delimits a group, thus stops optimizations
            TAIL        no      Match empty string. Can jump here from outside.
        
        So i guess this is because qr{} will generate patterns like that, and you normally wouldn't use /(?:foo|\d)/, you'd do something like /$pat|$QRed2/. This is because if perl sees
        4: BRANCH(6)
        5:   DIGIT(6)
        which means there are no patterns after DIGIT , it can optimize (cause the digit in parentheses is the index to the next pattern, or something along those lines), and that's not neccessarily true if you got (?:). I think it's one of those overlooked optimizations (hopefully japhy will answer concretely ~ i'm not a re => god yet ;D).

        No idea what this offsets thing is, I don't get it , but then I don't have perl compiled with the -DDEBUGGING, so it might relate to that somehow.

        ____________________________________________________
        ** The Third rule of perl club is a statement of fact: pod is sexy.