in reply to Re: RE performance
in thread RE performance

On perl 5.8.0 i686-linux-thread-multi with debugging:

$ perl testRegexen.pl Benchmark: running capture, noCapture, noParens, twoRegexen for at lea +st 3 CPU seconds... capture: 4 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 14 +1750.00/s (n=453600) noCapture: 3 wallclock secs ( 3.04 usr + 0.02 sys = 3.06 CPU) @ 27 +9933.01/s (n=856595) noParens: 1 wallclock secs ( 3.34 usr + 0.00 sys = 3.34 CPU) @ 26 +1686.53/s (n=874033) twoRegexen: 5 wallclock secs ( 3.41 usr + 0.01 sys = 3.42 CPU) @ 54 +0171.35/s (n=1847386) Rate capture noParens noCapture twoRegexen capture 141750/s -- -46% -49% -74% noParens 261687/s 85% -- -7% -52% noCapture 279933/s 97% 7% -- -48% twoRegexen 540171/s 281% 106% 93% -- $

Update: perl -Dr output, per request:

$ perl -Dr testRegexen.pl Compiling REx `(est|\d)' size 10 Got 84 bytes for offset annotations. first at 3 1: OPEN1(3) 3: BRANCH(6) 4: EXACT <est>(8) 6: BRANCH(8) 7: DIGIT(8) 8: CLOSE1(10) 10: END(0) minlen 1 Offsets: [10] 1[1] 0[0] 1[1] 2[3] 0[0] 5[1] 6[2] 8[1] 0[0] 9[0] Compiling REx `(?:est|\d)' size 7 Got 60 bytes for offset annotations. 1: BRANCH(4) 2: EXACT <est>(7) 4: BRANCH(6) 5: DIGIT(7) 6: TAIL(7) 7: END(0) minlen 1 Offsets: [7] 3[1] 4[3] 0[0] 7[1] 8[2] 9[0] 11[0] Compiling REx `est|\d' size 6 Got 52 bytes for offset annotations. 1: BRANCH(4) 2: EXACT <est>(6) 4: BRANCH(6) 5: DIGIT(6) 6: END(0) minlen 1 Offsets: [6] 0[0] 1[3] 0[0] 4[1] 5[2] 7[0] Compiling REx `est' size 3 Got 28 bytes for offset annotations. first at 1 rarest char s at 1 1: EXACT <est>(3) 3: END(0) anchored `est' at 0 (checking anchored isall) minlen 3 Offsets: [3] 1[3] 0[0] 4[0] Compiling REx `\d' size 2 Got 20 bytes for offset annotations. first at 1 1: DIGIT(2) 2: END(0) stclass `DIGIT' minlen 1 Offsets: [2] 1[2] 3[0] Omitting $` $& $' support. EXECUTING... $

After Compline,
Zaxo

Replies are listed 'Best First'.
Re^3: RE performance
by RMGir (Prior) on Sep 04, 2002 at 14:38 UTC
    Thanks!

    Isn't it strange that the noCapture case is faster than the noParens??

    Here's what use re 'debug' gets me for the 4 regex variants:

    Compiling REx `(est|\d)' size 10 first at 3 1: OPEN1(3) 3: BRANCH(6) 4: EXACT <est>(8) 6: BRANCH(8) 7: DIGIT(8) 8: CLOSE1(10) 10: END(0) minlen 1 Compiling REx `(?:est|\d)' size 7 1: BRANCH(4) 2: EXACT <est>(7) 4: BRANCH(6) 5: DIGIT(7) 6: TAIL(7) 7: END(0) minlen 1 Compiling REx `est|\d' size 6 1: BRANCH(4) 2: EXACT <est>(6) 4: BRANCH(6) 5: DIGIT(6) 6: END(0) minlen 1 Compiling REx `est' size 3 first at 1 1: EXACT <est>(3) 3: END(0) anchored `est' at 0 (checking anchored isall) minlen 3 Compiling REx `\d' size 2 first at 1 1: DIGIT(2) 2: END(0) stclass `DIGIT' minlen 1
    The TAIL(7) on the noCapture case looks redundant to me; maybe that's something they fixed in 5.8.0?

    What happens if you look at those regexes under 5.8.0 with -Dr or use re 'debug', please?

    ( Edit: Zaxo did this just above, and got the same results, except he gets offsets. No idea what those mean...)
    --
    Mike

      "TAIL" is the node that defines the end of a non-capturing parenthetical group or an if-then parenthetical group (like (?(1)a|b)). That's about all. It gets optimized out for non-capturing parens if there is no "BRANCH" found (that is, no "|" metacharacter), so that something like /A(?:B)C/ can be optimized to an "EXACT" node matching "ABC".

      Basically, "TAIL" takes the place of having another "indented" layer.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      I think I got a line on that one (i finally read perldebguts), and I quote (cause I likes to ;)
      
          # Do nothing
          NOTHING     no      Match empty string.
          # A variant of above which delimits a group, thus stops optimizations
          TAIL        no      Match empty string. Can jump here from outside.
      
      So i guess this is because qr{} will generate patterns like that, and you normally wouldn't use /(?:foo|\d)/, you'd do something like /$pat|$QRed2/. This is because if perl sees
      4: BRANCH(6)
      5:   DIGIT(6)
      which means there are no patterns after DIGIT , it can optimize (cause the digit in parentheses is the index to the next pattern, or something along those lines), and that's not neccessarily true if you got (?:). I think it's one of those overlooked optimizations (hopefully japhy will answer concretely ~ i'm not a re => god yet ;D).

      No idea what this offsets thing is, I don't get it , but then I don't have perl compiled with the -DDEBUGGING, so it might relate to that somehow.

      ____________________________________________________
      ** The Third rule of perl club is a statement of fact: pod is sexy.