alternation in regexes: to use or to avoid?

http://qs1969.pair.com?node_id=1008105

dk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I've hit a performance issue with using | in regexes. It seems that in some (not-so-degenerated, actually) cases it loses significantly to looping over simple regexes i.e. $str =~ /$_// for @rx is much faster than $str =~ /$rx[0]|$rx[1]|$rx[2]/, which is rather counter-intuitive. Basically, for general cases, it would mean that alterations with grouping should be avoided at all, which is a strong statement and I wouldn't like it that way.

Is this a recognized problem? Is it a problem at all? Does it look like it needs to be reported as a bug? I can't decide myself.

Here's the test code:

use strict;
use warnings;
use Benchmark qw(:all);

my $str = 'a' x 100;

my @matchwords = qw(
aol aachen aaliyah aaron abbas abbasid abbott abby abdul abe abel abel
+ard abelson
aberdeen abernathy abidjan abigail abilene abner abraham abram abrams 
+absalom abuja
abyssinia abyssinian ac acadia acapulco accra acevedo achaean
);

my $q1s = join('|', map { "$_\\s*\\w+" } @matchwords );
my $q1  = qr/$q1s/;

my @q2s = map { "$_\\s*\\w+" } @matchwords;
my @q2  = map { qr/$_/ } @q2s;

my $q3s = join('|', map { "($_)\\s*\\w+" } @matchwords );
my $q3  = qr/$q3s/;

my @q4s = map { "($_)\\s*\\w+" } @matchwords;
my @q4  = map { qr/$_/ } @q4s;

timethese( 100000, {
    'alternation, no grouping' => sub {
        $str =~ /$q1/;
    },
    'loop, no grouping' => sub {
        for my $qr ( @q2 ) {
            $str =~ /$qr/;
        }
    },
    'alternation, grouping' => sub {
        $str =~ /$q3/;
    },
    'loop, grouping' => sub {
        for my $qr ( @q4 ) {
            $str =~ /$qr/;
        }
    },
});
[download]

Here's the output:

Benchmark: timing 100000 iterations ...
alternation, grouping: 12 wallclock secs (11.92 usr +  0.00 sys = 11.9
+2 CPU) @ 8389.26/s (n=100000)
alternation, no grouping:  0 wallclock secs ( 0.19 usr +  0.00 sys =  
+0.19 CPU) @ 526315.79/s (n=100000)
            (warning: too few iterations for a reliable count)
loop, grouping:  2 wallclock secs ( 1.33 usr +  0.00 sys =  1.33 CPU) 
+@ 75187.97/s (n=100000)
loop, no grouping:  1 wallclock secs ( 1.33 usr +  0.00 sys =  1.33 CP
+U) @ 75187.97/s (n=100000)
[download]

Update: got same results on perls 5.10.1, 5.16.0, and 5.17.6

Comment on alternation in regexes: to use or to avoid? Select or Download Code

Replies are listed 'Best First'.

Re: alternation in regexes: to use or to avoid?
by Athanasius (Archbishop) on Dec 10, 2012 at 15:07 UTC

Perhaps the following quote from the Camel Book will shed some light on this question:

Short-circuit alternation is often faster than the corresponding regex. So:
print if /one-hump/ || /two/;
[download]
is likely to be faster than:
print if /one-hump|two/;
[download]
at least for certain values of one-hump and two. This is because the optimizer likes to hoist certain simple matching operations up into higher parts of the syntax tree and do very fast matching with a Boyer-Moore algorithm. A complicated pattern tends to defeat this.
— Tom Christiansen, brian d foy & Larry Wall with Jon Orwant, Programming Perl (4^th Edition, 2012), p. 692.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: alternation in regexes: to use or to avoid?

by dk (Chaplain) on Dec 10, 2012 at 15:14 UTC

A complicated pattern tends to defeat this.

and i'm seeing exactly the opposite. I wish Tom would comment on that :) But thank you for the quote, it helps with understanding why I think that the observed behavior is bad.

Re^3: alternation in regexes: to use or to avoid?

by Anonymous Monk on Dec 10, 2012 at 15:22 UTC

Perhaps read "complicated" as "non-trivial", EG: having alternations

Re^4: alternation in regexes: to use or to avoid?

by balker (Novice) on Dec 10, 2012 at 15:29 UTC

Re: alternation in regexes: to use or to avoid?
by dave_the_m (Monsignor) on Dec 10, 2012 at 17:08 UTC

\s*\w+

my $q1s = join('|', @matchwords);
my $q1  = qr/$q1s\s*\w+/;

my $q3s = join('|', @matchwords);
my $q3  = qr/($q3s)\\s*\\w+/;
[download]

alternation, grouping:  1 wallclock secs ( 0.58 usr +  0.00 sys =  0.5
+8 CPU) @ 1724137.93/s (n=1000000)
alternation, no grouping:  5 wallclock secs ( 4.85 usr +  0.00 sys =  
+4.85 CPU) @ 206185.57/s (n=1000000)
loop, grouping: 24 wallclock secs (23.33 usr +  0.00 sys = 23.33 CPU) 
+@ 42863.27/s (n=1000000)
loop, no grouping: 23 wallclock secs (22.57 usr +  0.00 sys = 22.57 CP
+U) @ 44306.60/s (n=1000000)
[download]

Dave.

[reply]
[d/l]
[select]

Re^2: alternation in regexes: to use or to avoid?

by balker (Novice) on Dec 10, 2012 at 18:42 UTC

But that doesn't leave room for the various match-words to have different "\w+"'s - e.g. /\bFOO:\s*bar(\d+)/ or /\bBAZ:\s*(\w+)/

I guess I'm wondering why the trie-optimization isn't used for fixed string prefixes as well (as I'd assumed before I wrote the code).

Re^3: alternation in regexes: to use or to avoid?

by dave_the_m (Monsignor) on Dec 10, 2012 at 19:40 UTC

are

So, alternation is the fastest, as long as you put any captures outside the alt.

Dave.

Re: alternation in regexes: to use or to avoid?
by ww (Archbishop) on Dec 10, 2012 at 15:27 UTC

"Is this a recognized problem? Yes, see Athanasius', above.
"Is it a problem at all? Yes, but not one that's apt to be resolved other than by careful choice among the alternate approaches.
"Does it look like it needs to be reported as a bug? No; see 1 above
Does this "mean that alterations with grouping should be avoided at all...?" Definitely not; sometimes the difference in speed is too small to make any difference; sometimes the clarity of one approach clearly outweighs any other issues; and sometimes other factors, like personal taste, can be allowed to determine. Just be sure to think carefully about which applies.

Re: alternation in regexes: to use or to avoid?
by RichardK (Parson) on Dec 10, 2012 at 15:31 UTC

It's not clear to me what you are trying to achieve with your regex.

The simple grouping that look like this

aol\s*\w+|aachen\s*\w+|aaliyah\s*\w+|.....

runs quickly, it's only the one with lots of capture groups that is slow. i.e

(aol)\s*\w+|(aachen)\s*\w+|(aaliyah)\s*\w+|....

So maybe there's just a better way to get the result you want, if you'd care to explain what that is?

[reply]
[d/l]
[select]

Re^2: alternation in regexes: to use or to avoid?

by balker (Novice) on Dec 10, 2012 at 15:51 UTC

(I work with dk.)

Another question could be: why is the one with the capture groups so slow, since none of the words match the string?

And in general, why is alternation&capture so much slower than looping&capture + alternation combined?

The reason for the code is to replace code with 60 or so similarly structured regexes in a library used by a couple of legacy applications with an automatically generated regex generated with info from configuration files, for both (potential) performance gains, allowing different behaviour across applications, and definite maintainability gains. The strings replaced all have the structure \bFOO:\s*bar(\d+) or \bBAZ:\s*(\w+) etc.

Suggestions like "Well, don't do that" are likely to go unheard :-)

Re^3: alternation in regexes: to use or to avoid?

by RichardK (Parson) on Dec 10, 2012 at 16:08 UTC

OK then, If you want to use a non-optimal solution for operational reasons, go right ahead :-)

Re^2: alternation in regexes: to use or to avoid?

by dk (Chaplain) on Dec 10, 2012 at 16:07 UTC

Re: alternation in regexes: to use or to avoid?
by space_monk (Chaplain) on Dec 10, 2012 at 14:55 UTC

Trying to sound knowledgeable without knowing why or how you got the results you did, I would suspect that this is something that you can't rely on in every version of Perl, and possibly even system to system.

A Monk aims to give answers to those who have none, and to learn from those who know more.

Re^2: alternation in regexes: to use or to avoid?

by dk (Chaplain) on Dec 10, 2012 at 15:03 UTC

Right, forgot that ... I tried it on 5.10, 5.16, and 5.17.6, with almost identical results. As to a system, I really doubt that it matters.

Re: alternation in regexes: to use or to avoid?
by Anonymous Monk on Dec 10, 2012 at 16:44 UTC

The amount of time the regex takes to execute is insignificant in the long run. What matters is how much I/O the program does or doesn't do, and that includes virtual-memory. Process the data in reasonably sized chunks, applying whatever you might know about which test is most likely to succeed first. Make the whole thing easy to maintain. Don't sweat nanoseconds when it's milliseconds that matter.

Re^2: alternation in regexes: to use or to avoid?

by balker (Novice) on Dec 10, 2012 at 18:46 UTC

Thank you, but your statement is contradicting itself. We're seeing milliseconds wasted in the regex (for a web-app!), which is why we bothered to examine why in the first place.

Back to Seekers of Perl Wisdom