Re: RE performance

Try it without the () in your first regex. The RE engine has to do a bit of setup work every time it enters or leaves a capturing () set during its passage through the candidate string, so they add a lot of overhead.

Also, your "||" version isn't very fair, since the /est/ always matches, meaning the /\d/ test is never made. But it is interesting how much faster the "constant" match for est is as opposed to the match for est|\d.

Oh, and use Benchmark:

#!/usr/bin/perl -w

use strict;
use Benchmark qw(cmpthese);

my $i=0;

sub capture {
    "test$i"=~/(est|\d)/;
}

sub noCapture {
    "test$i"=~/(?:est|\d)/;
}

sub noParens {
    "test$i"=~/est|\d/;
}

sub twoRegexen {
    "test$i"=~/est/ || "test$i"=~/\d/;
}

cmpthese(-3,
        {
            capture   => \&capture,
            noCapture => \&noCapture,
            noParens => \&noParens,
            twoRegexen => \&twoRegexen,
        }
     );
[download]

Results (with 5.6.1, I'm sure 5.8.0 would be somewhat different):

$ perl testRegexen.pl
Benchmark: running capture, noCapture, noParens, twoRegexen, each for 
+at least 3
 CPU seconds...
   capture:  4 wallclock secs ( 3.18 usr +  0.00 sys =  3.18 CPU) @ 72
+8822.24/s (n=2320570)
 noCapture:  2 wallclock secs ( 3.10 usr +  0.01 sys =  3.11 CPU) @ 82
+7692.26/s (n=2576606)
  noParens:  4 wallclock secs ( 3.14 usr +  0.01 sys =  3.15 CPU) @ 87
+1936.55/s (n=2748344)
twoRegexen:  4 wallclock secs ( 3.41 usr + -0.00 sys =  3.41 CPU) @ 14
+78048.90/s  (n=5047537)
                Rate    capture  noCapture   noParens twoRegexen
capture     728822/s         --       -12%       -16%       -51%
noCapture   827692/s        14%         --        -5%       -44%
noParens    871937/s        20%         5%         --       -41%
twoRegexen 1478049/s       103%        79%        70%         --
[download]

--
Mike

Edit: Oops, when I added the noParens case I forgot to put it into the cmpthese call here, although I'd put it into my test script.

Comment on Re: RE performance Select or Download Code

Replies are listed 'Best First'.
Re: Re: RE performance by Zaxo (Archbishop) on Sep 04, 2002 at 14:19 UTC
On perl 5.8.0 i686-linux-thread-multi with debugging: $ perl testRegexen.pl Benchmark: running capture, noCapture, noParens, twoRegexen for at lea +st 3 CPU seconds... capture: 4 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 14 +1750.00/s (n=453600) noCapture: 3 wallclock secs ( 3.04 usr + 0.02 sys = 3.06 CPU) @ 27 +9933.01/s (n=856595) noParens: 1 wallclock secs ( 3.34 usr + 0.00 sys = 3.34 CPU) @ 26 +1686.53/s (n=874033) twoRegexen: 5 wallclock secs ( 3.41 usr + 0.01 sys = 3.42 CPU) @ 54 +0171.35/s (n=1847386) Rate capture noParens noCapture twoRegexen capture 141750/s -- -46% -49% -74% noParens 261687/s 85% -- -7% -52% noCapture 279933/s 97% 7% -- -48% twoRegexen 540171/s 281% 106% 93% -- $ [download] Update: perl -Dr output, per request: Read more... (2 kB) After Compline, Zaxo	[reply] [d/l] [select]
Re^3: RE performance by RMGir (Prior) on Sep 04, 2002 at 14:38 UTC
Thanks! Isn't it strange that the noCapture case is faster than the noParens?? Here's what use re 'debug' gets me for the 4 regex variants: Compiling REx `(est\|\d)' size 10 first at 3 1: OPEN1(3) 3: BRANCH(6) 4: EXACT <est>(8) 6: BRANCH(8) 7: DIGIT(8) 8: CLOSE1(10) 10: END(0) minlen 1 Compiling REx `(?:est\|\d)' size 7 1: BRANCH(4) 2: EXACT <est>(7) 4: BRANCH(6) 5: DIGIT(7) 6: TAIL(7) 7: END(0) minlen 1 Compiling REx `est\|\d' size 6 1: BRANCH(4) 2: EXACT <est>(6) 4: BRANCH(6) 5: DIGIT(6) 6: END(0) minlen 1 Compiling REx `est' size 3 first at 1 1: EXACT <est>(3) 3: END(0) anchored `est' at 0 (checking anchored isall) minlen 3 Compiling REx `\d' size 2 first at 1 1: DIGIT(2) 2: END(0) stclass `DIGIT' minlen 1 [download] The TAIL(7) on the noCapture case looks redundant to me; maybe that's something they fixed in 5.8.0? What happens if you look at those regexes under 5.8.0 with -Dr or use re 'debug', please? ( Edit: Zaxo did this just above, and got the same results, except he gets offsets. No idea what those mean...) -- Mike	[reply] [d/l]
Re: Re^3: RE performance by japhy (Canon) on Sep 06, 2002 at 14:38 UTC
"TAIL" is the node that defines the end of a non-capturing parenthetical group or an if-then parenthetical group (like `(?(1)a\|b)`). That's about all. It gets optimized out for non-capturing parens if there is no "BRANCH" found (that is, no "\|" metacharacter), so that something like `/A(?:B)C/` can be optimized to an "EXACT" node matching "ABC". Basically, "TAIL" takes the place of having another "indented" layer. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Re^3: RE performance by PodMaster (Abbot) on Sep 05, 2002 at 06:44 UTC
I think I got a line on that one (i finally read perldebguts), and I quote (cause I likes to ;) # Do nothing NOTHING no Match empty string. # A variant of above which delimits a group, thus stops optimizations TAIL no Match empty string. Can jump here from outside. So i guess this is because qr{} will generate patterns like that, and you normally wouldn't use /(?:foo\|\d)/, you'd do something like /$pat\|$QRed2/. This is because if perl sees 4: BRANCH(6) 5: DIGIT(6) which means there are no patterns after DIGIT , it can optimize (cause the digit in parentheses is the index to the next pattern, or something along those lines), and that's not neccessarily true if you got (?:). I think it's one of those overlooked optimizations (hopefully japhy will answer concretely ~ i'm not a re => god yet ;D). No idea what this offsets thing is, I don't get it , but then I don't have perl compiled with the -DDEBUGGING, so it might relate to that somehow. `____________________________________________________` ** The Third rule of perl club is a statement of fact: pod is sexy.	[reply]