Re^2: A NOT in regular expressions (why [^%>]?)

Someone asked privately if the % in [^%>] was required. That is a good question, so I decided to answer it in public.

Without that %, we get:

    m[
      <%        # Opening delimiter.
      (?:       # Match stuff that isn't a closing delim:
        [^%]+     # Things that can't start one.
      | %+[^>]    # Might start one but isn't one.
      )*        # As many non-closing-delims as you like.
      %>        # Closing delimiter.
    ]x
[download]

which will match as follows:

    "<% %%> %>"
    "<%" matches <%
      " " matches [^%]+ so (?: ... )* has matched once
       "%%" matches %+
         ">" fails on [^>] so we back-track
       "%" now matches %+
        "%" matches [^>] so (?: ... )* has matched twice
         "> " matches [^%]+ so (?: ... )* has matched 3 times
           "%>" matches %> so regex finishes
[download]

so we've matched the whole string when we should have only matched the first part, "<% %%>".

By leaving the % out of [^%>], we've allowed the regex to back-track and match the first character of our delimiter (%) as the tail end of %+[^>].

But I now realize that my regex is also broken because it will never match:

    "<% %%>"
[download]

at all. I'm tempted to fix it with:

    m[
      <%        # Opening delimiter.
      (?:       # Match stuff that isn't a closing delim:
        [^%]+     # Things that can't start one.
      | %+[^%>]   # Might start one but isn't one.
      )*        # As many non-closing-delims as you like.
      %*        # PUNT!
      %>        # Closing delimiter.
    ]x
[download]

but that seems wrong. Think...

Bah, I'm hours later for bed already. Serves my right for "showing off" "unrolling the loop" when I've seen *so many* *really good* regex slingers get this wrong more than once. (:

Unlike the last time I saw this happen, these nodes will *not* be updated to hide the mistakes I've made (that last time the updates were flying really fast and I was extremely frustrated by not being able to learn from the repeated mistakes).

- tye

Comment on Re^2: A NOT in regular expressions (why [^%>]?) Select or Download Code

Replies are listed 'Best First'.
Re: Re^2: A NOT in regular expressions (why [^%>]?) by jryan (Vicar) on May 14, 2003 at 08:35 UTC
The simplest solution is to just remove the quantifier, and add a negative lookahead. Then the regex becomes: `m[ <% # Opening delimiter. (?: [^%]+ # Things that can't start one. \| % (?!>) # % that aren't part of a closer )* # As many non-closing-delims as you like. %> # Closing delimiter. ]x` [download] Backtracking no longer needed. (-:	[reply] [d/l]
Re: Re^2: A NOT in regular expressions (why [^%>]?) by tilly (Archbishop) on May 14, 2003 at 08:36 UTC
Why not this? `m[ <% # Opening delimiter. (?: # Match stuff that isn't a closing delim: [^%] # Can't start one. \| %(?!>) # Might start one but isn't one. )* # As many non-closing-delims as you like. %> # Closing delimiter. ]x` [download]	[reply] [d/l]
Re: Re^2: A NOT in regular expressions (why [^%>]?) by bart (Canon) on May 14, 2003 at 09:11 UTC
Even if you got that regexp right from the start (you didn't, see the subthread with Abigail for why), you've been using that's considered a big no-no: (excerpt) `(?: # Match stuff that isn't a closing delim: [^%]+ # Things that can't start one. \| %+[^>] # Might start one but isn't one. )` [download] You're using the dreaded `/(A+\|B+)/`, a star on top of a plus. That's considered very bad for regexps, because if the pattern fails for some reason, you'll get lots of unnecessary backtracking. Jeffrey Friedl also discusses this in his book "Mastering Regular Expressions", Chapter 5 p.144 in the 1st edition (which is all I have) under the subtitle "Reality Check". For it to behave properly, you should loose the plusses.	[reply] [d/l] [select]
Re^4: A NOT in regular expressions (no new features?) by tye (Sage) on May 14, 2003 at 18:09 UTC
Thanks for all the replies, everyone. I had considered look-aheads but didn't want to go there because there is supposed to be a way to do this that works well and doesn't require these new regex features. How would you get Perl 4 or sed to match C comments? I've read Mastering Regular Expressions but not recently and no longer have a copy. I realize that the author doesn't like nesting of quantifiers like that. But avoiding the quantifiers also means that the outer construct has to match more times. Since Perl stops after that happens 32k times, the extra quantifier can make a real improvement. I prefer to prevent rampant back-tracking on failure by making the regex very explicit so that back-tracking won't happen. If I can't do that, then I consider solving the problem in smaller pieces. - tye	[reply]
Re: Re^4: A NOT in regular expressions (no new features?) by tilly (Archbishop) on May 14, 2003 at 18:40 UTC
Alternate solutions. The simplest is to use a minimal match: `m/<%.?%>/s`. However you can get by with basic features with something like: `m/<%[^%](%+[^%>][^%])%+>/`. That should, possibly with appropriate syntax adjustments, work in any RE engine worthy of the name.	[reply] [d/l] [select]
Re^4: A NOT in regular expressions (why [^%>]?) by Aristotle (Chancellor) on May 14, 2003 at 11:04 UTC
Making the `(?: )` into a `(?> )` should also fix that without removing the pluses, no? Makeshifts last the longest.	[reply]
Re: Re^4: A NOT in regular expressions (why [^%>]?) by BrowserUk (Patriarch) on May 14, 2003 at 11:28 UTC
Um, no. `$s = '1 <% xxx%%> 2 <%%> 3 <%>%> 4 <% >% %> 5 <%%%% xxx %%%%> 6 '; $s =~ s/<% (?> [^%]+ \| %+ [^>]+ )* %>/!REPLACED!/xg; print $s; 1 !REPLACED!%> 4 <% >% %> 5 <%%%% xxx %%%%> 6` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Re: Re^2: A NOT in regular expressions (why [^%>]?) by tilly (Archbishop) on May 14, 2003 at 15:07 UTC
Perl is smarter than you think. It has a hack that mostly fixes this problem (and does in the above case).	[reply]
Re: Re^2: A NOT in regular expressions (why [^%>]?) by jryan (Vicar) on Oct 20, 2003 at 21:46 UTC
I was looking for an old node of mine, and I came across this node again. I was bored, so I decided to work it out. First, the flow: http://jryan.perlmonk.org/images/uloop.gif A green node in this case means "that char", and a red node means "anything but that char". Green lines mean "yes", Red lines mean "no." So, we can directly translate that into the code: `m[ < % # node 0 -> node 1 -> ( # node 0 -> node 1 -> node 2 (?) -> % # node 0 -> node 1 -> node 2 (yes) -> ( # node 0 -> node 1 -> node 2 (yes) -> node 4 (?) -> [^>] # node 0 -> node 1 -> node 2 (yes) -> node 4 (no) - +> \| [^%] # node 0 -> node 1 -> node 2 (yes) -> node 4 (yes) +-> # node 3 (?) -> ) \| [^%] # node 0 -> node 1 -> node 2 (no) -> node 3 (?) -> )* # (%)+ # node 5 (?) > # node 5 (no) -> node 6 ]x` [download] And, Perl lets us condense that into: `m[ < % (?: [^%]+ \| % [^%>] )* %* # I left it as %* so the "insides" can be easily # grouped & captured % > ]x` [download] So, your quick hack of a fix turns out to be the proper solution after all :)	[reply] [d/l] [select]
Re^4: A NOT in regular expressions (thanks) by tye (Sage) on Oct 23, 2003 at 18:00 UTC
The first one fails by not stopping soon enough for `"<% %%> %>"`. The second fails by not matching `"<% %% %>"`. Thanks for the thoughts. - tye	[reply] [d/l] [select]
Re: Re^4: A NOT in regular expressions (thanks) by idsfa (Vicar) on Nov 02, 2003 at 00:04 UTC
Cribbing from Mastering Regular Expressions' section on removing C-style quotes ... `qr{ <% [^%]* %+ ( [^>%] [^%]* %+ )* > }x;` [download] appears to properly handle every example in this thread, as well as not taking forever on a failed match. Read more... (1162 Bytes) Remember, when you stare long into the abyss, you could have been home eating ice cream.	[reply] [d/l] [select]