in reply to This regex seems to have splattered non-greedy everywhere

The core of the first regexp looks like this:

( [^']* (stuff)? )*

Because 'stuff' is optional, this becomes an exponential search: /( [^']* )*/x can match n characters 2n ways (as long as none is a quote character).*

The alternate regexp looks like this:

[^']* ( stuff [^']* )*
and now, because 'stuff' is not optional, the search is no longer exponential.

This is the cause of the slowdown with the first version in your non-perl code. The same slowdown doesn't happen in perl because the regexp engine has some special tricksy handling for this situation, and if it sees something looking exponential it turns on some caching behaviour to avoid unnecessary repetition. (If you use re qw/debug/; you'll see this:

Detected a super-linear match, switching on caching...
in the output when it happens.)

While I am not currently aware of bugs in that area, I kinda expect that there are some, and I suspect that's what is biting you here; the fact that the second regexp does not suffer the same problem is strong evidence for this. (I can confirm that the behaviour persists in the latest bleadperl.)

I'd suggest a) sending a perlbug about it, and b) trying real hard to remove exponential behaviour from regexps when you can.

Hugo

*Update: to be more strictly correct, /( [^']+ )*/x can match 2n ways, while /( [^']* )*/x can match infintely many ways. But the regexp engine automatically suppresses irrelevant multiple zero-length matches, so the effect is much the same.

Replies are listed 'Best First'.
Re^2: This regex seems to have splattered non-greedy everywhere
by fizbin (Chaplain) on Aug 11, 2005 at 13:31 UTC
    b) trying real hard to remove exponential behaviour from regexps when you can.
    Yes, well, there's the problem of recognizing this before it bites me in the ass in production. I mean, I guess this particular exponential behavior might be something I'll recognize in the future, but is there some automated tool that can recognize when regular expressions have the potential for exponential behavior? (Or is this one of those problems that get as hard as the decideability problem?)
    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

      The general question is: given a regexp fragment like

      ( prefix (meat)* suffix )*
      is is possible for the prefix and suffix to be simultaneously (and repeatedly) empty? If so, then anything that matches /(meat){n}/ will be able to match at least 2^n different ways.

      I'm sure that at the least a subset of exponential regexps should be discernible. I imagine the first stage of such an approach would be to iteratively strip out anything that can be zero-width, in which case it would be rather harder to detect that eg:

      /( (foo)* (?=b) (bar)? )* baz )/x
      is linear. Ask me again when Perl6 has a full grammar implementation - it might be an interesting project to write such a detector targetting a pluggable regexp grammar.

      Note that one generic approach to fixing such regexps is to use the 'cut' operator (?> ...), but as far as I am aware that operator is available only in perl itself and the ever-faithful PCRE. For example:

      /( (?> [^']* ) ('[^']*')? )* /x
      avoids the exponential behaviour by disallowing backtracking into the [^']*.

      Hugo