in reply to Re^2: A NOT in regular expressions (why [^%>]?)
in thread A NOT in regular expressions

Even if you got that regexp right from the start (you didn't, see the subthread with Abigail for why), you've been using that's considered a big no-no: (excerpt)
(?: # Match stuff that isn't a closing delim: [^%]+ # Things that can't start one. | %+[^>] # Might start one but isn't one. )*
You're using the dreaded /(A+|B+)*/, a star on top of a plus. That's considered very bad for regexps, because if the pattern fails for some reason, you'll get lots of unnecessary backtracking. Jeffrey Friedl also discusses this in his book "Mastering Regular Expressions", Chapter 5 p.144 in the 1st edition (which is all I have) under the subtitle "Reality Check".

For it to behave properly, you should loose the plusses.

Replies are listed 'Best First'.
Re^4: A NOT in regular expressions (no new features?)
by tye (Sage) on May 14, 2003 at 18:09 UTC

    Thanks for all the replies, everyone. I had considered look-aheads but didn't want to go there because there is supposed to be a way to do this that works well and doesn't require these new regex features.

    How would you get Perl 4 or sed to match C comments? I've read Mastering Regular Expressions but not recently and no longer have a copy.

    I realize that the author doesn't like nesting of quantifiers like that. But avoiding the quantifiers also means that the outer construct has to match more times. Since Perl stops after that happens 32k times, the extra quantifier can make a real improvement. I prefer to prevent rampant back-tracking on failure by making the regex very explicit so that back-tracking won't happen. If I can't do that, then I consider solving the problem in smaller pieces.

                    - tye
      Alternate solutions. The simplest is to use a minimal match: m/<%.*?%>/s. However you can get by with basic features with something like: m/<%[^%]*(%+[^%>][^%]*)*%+>/. That should, possibly with appropriate syntax adjustments, work in any RE engine worthy of the name.
Re^4: A NOT in regular expressions (why [^%>]?)
by Aristotle (Chancellor) on May 14, 2003 at 11:04 UTC
    Making the (?: ) into a (?> ) should also fix that without removing the pluses, no?

    Makeshifts last the longest.

      Um, no.

      $s = '1 <% xxx%%> 2 <%%> 3 <%>%> 4 <% >% %> 5 <%%%% xxx %%%%> 6 '; $s =~ s/<% (?> [^%]+ | %+ [^>]+ )* %>/!REPLACED!/xg; print $s; 1 !REPLACED!%> 4 <% >% %> 5 <%%%% xxx %%%%> 6

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: Re: Re^2: A NOT in regular expressions (why [^%>]?)
by tilly (Archbishop) on May 14, 2003 at 15:07 UTC
    Perl is smarter than you think. It has a hack that mostly fixes this problem (and does in the above case).