A NOT in regular expressions

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A NOT in regular expressions by halley (Prior) on May 14, 2003 at 02:53 UTC
In your case, don't do a "not", just be less greedy: `$foo =~ s/ \<\% # any left doohickey .? # minimum span of junk \%\> # first right doohickey //xg;` [download] There is a "negative assertion" which really is a NOT, but it's not as useful here as a simple non-greedy match. Read perlre for more information on the `.?` non-greedy match, the `(?! )` negative lookahead assertion, and the `/x` modifier for readable patterns. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: A NOT in regular expressions (and what else?) by tye (Sage) on May 14, 2003 at 04:15 UTC
If there is nothing else in your regular expression, then you can use the simple solution others have suggested `/<%.?%>/` [download] but if you have other bits in the regex, then you may need to "unroll the loop" like we did in the bad old regex days before we had non-greedy qualifiers: `m[ <% # Opening delimiter. (?: # Match stuff that isn't a closing delim: [^%]+ # Things that can't start one. \| %+[^%>] # Might start one but isn't one. ) # As many non-closing-delims as you like. %> # Closing delimiter. ]x` [download] but it doesn't look like you need to. I just thought you might find the more general case interesting. (: - tye	[reply] [d/l] [select]
Re^2: A NOT in regular expressions (why [^%>]?) by tye (Sage) on May 14, 2003 at 08:22 UTC
Someone asked privately if the % in `[^%>]` was required. That is a good question, so I decided to answer it in public. Without that %, we get: `m[ <% # Opening delimiter. (?: # Match stuff that isn't a closing delim: [^%]+ # Things that can't start one. \| %+[^>] # Might start one but isn't one. )* # As many non-closing-delims as you like. %> # Closing delimiter. ]x` [download] which will match as follows: `"<% %%> %>" "<%" matches <% " " matches [^%]+ so (?: ... )* has matched once "%%" matches %+ ">" fails on [^>] so we back-track "%" now matches %+ "%" matches [^>] so (?: ... )* has matched twice "> " matches [^%]+ so (?: ... )* has matched 3 times "%>" matches %> so regex finishes` [download] so we've matched the whole string when we should have only matched the first part, `"<% %%>"`. By leaving the % out of `[^%>]`, we've allowed the regex to back-track and match the first character of our delimiter (%) as the tail end of `%+[^>]`. But I now realize that my regex is also broken because it will never match: `"<% %%>"` [download] at all. I'm tempted to fix it with: `m[ <% # Opening delimiter. (?: # Match stuff that isn't a closing delim: [^%]+ # Things that can't start one. \| %+[^%>] # Might start one but isn't one. )* # As many non-closing-delims as you like. %* # PUNT! %> # Closing delimiter. ]x` [download] but that seems wrong. Think... Bah, I'm hours later for bed already. Serves my right for "showing off" "unrolling the loop" when I've seen so many really good regex slingers get this wrong more than once. (: Unlike the last time I saw this happen, these nodes will not be updated to hide the mistakes I've made (that last time the updates were flying really fast and I was extremely frustrated by not being able to learn from the repeated mistakes). - tye	[reply] [d/l] [select]
Re: Re^2: A NOT in regular expressions (why [^%>]?) by jryan (Vicar) on May 14, 2003 at 08:35 UTC
The simplest solution is to just remove the quantifier, and add a negative lookahead. Then the regex becomes: `m[ <% # Opening delimiter. (?: [^%]+ # Things that can't start one. \| % (?!>) # % that aren't part of a closer )* # As many non-closing-delims as you like. %> # Closing delimiter. ]x` [download] Backtracking no longer needed. (-:	[reply] [d/l]
Re: Re^2: A NOT in regular expressions (why [^%>]?) by tilly (Archbishop) on May 14, 2003 at 08:36 UTC
Why not this? `m[ <% # Opening delimiter. (?: # Match stuff that isn't a closing delim: [^%] # Can't start one. \| %(?!>) # Might start one but isn't one. )* # As many non-closing-delims as you like. %> # Closing delimiter. ]x` [download]	[reply] [d/l]
Re: Re^2: A NOT in regular expressions (why [^%>]?) by bart (Canon) on May 14, 2003 at 09:11 UTC
Even if you got that regexp right from the start (you didn't, see the subthread with Abigail for why), you've been using that's considered a big no-no: (excerpt) `(?: # Match stuff that isn't a closing delim: [^%]+ # Things that can't start one. \| %+[^>] # Might start one but isn't one. )` [download] You're using the dreaded `/(A+\|B+)/`, a star on top of a plus. That's considered very bad for regexps, because if the pattern fails for some reason, you'll get lots of unnecessary backtracking. Jeffrey Friedl also discusses this in his book "Mastering Regular Expressions", Chapter 5 p.144 in the 1st edition (which is all I have) under the subtitle "Reality Check". For it to behave properly, you should loose the plusses.	[reply] [d/l] [select]
Re^4: A NOT in regular expressions (no new features?) by tye (Sage) on May 14, 2003 at 18:09 UTC
Re: Re^4: A NOT in regular expressions (no new features?) by tilly (Archbishop) on May 14, 2003 at 18:40 UTC
Re^4: A NOT in regular expressions (why [^%>]?) by Aristotle (Chancellor) on May 14, 2003 at 11:04 UTC
Re: Re^4: A NOT in regular expressions (why [^%>]?) by BrowserUk (Patriarch) on May 14, 2003 at 11:28 UTC
Re: Re: Re^2: A NOT in regular expressions (why [^%>]?) by tilly (Archbishop) on May 14, 2003 at 15:07 UTC
Re: Re^2: A NOT in regular expressions (why [^%>]?) by jryan (Vicar) on Oct 20, 2003 at 21:46 UTC
I was looking for an old node of mine, and I came across this node again. I was bored, so I decided to work it out. First, the flow: http://jryan.perlmonk.org/images/uloop.gif A green node in this case means "that char", and a red node means "anything but that char". Green lines mean "yes", Red lines mean "no." So, we can directly translate that into the code: `m[ < % # node 0 -> node 1 -> ( # node 0 -> node 1 -> node 2 (?) -> % # node 0 -> node 1 -> node 2 (yes) -> ( # node 0 -> node 1 -> node 2 (yes) -> node 4 (?) -> [^>] # node 0 -> node 1 -> node 2 (yes) -> node 4 (no) - +> \| [^%] # node 0 -> node 1 -> node 2 (yes) -> node 4 (yes) +-> # node 3 (?) -> ) \| [^%] # node 0 -> node 1 -> node 2 (no) -> node 3 (?) -> )* # (%)+ # node 5 (?) > # node 5 (no) -> node 6 ]x` [download] And, Perl lets us condense that into: `m[ < % (?: [^%]+ \| % [^%>] )* %* # I left it as %* so the "insides" can be easily # grouped & captured % > ]x` [download] So, your quick hack of a fix turns out to be the proper solution after all :)	[reply] [d/l] [select]
Re^4: A NOT in regular expressions (thanks) by tye (Sage) on Oct 23, 2003 at 18:00 UTC
Re: Re^4: A NOT in regular expressions (thanks) by idsfa (Vicar) on Nov 02, 2003 at 00:04 UTC
Re: Re: A NOT in regular expressions (and what else?) by Anonymous Monk on May 14, 2003 at 07:46 UTC
There will be most likely be % and > in the resulting string but there won't be a %> together in the string Thanks Bruce	[reply]
Re^3: A NOT in regular expressions (and what else?) by tye (Sage) on May 14, 2003 at 07:54 UTC
Either solution will handle that. The simple solution will often not work if it is part of a larger regex. For example, `/<%(.?)%> home <%(.?)%>/ =~ "Go <% now %> to <% your %> home <% if %> you <% can %>" # ^^(^^^^^^^^^^^^^^^^^)^^^^^^^^^^(^^)^^` [download] will match as shown on that third line. Note that the first part matches too much. This is because a non-greedy part prefers to match sooner at the expense of being "more greedy" than to fail to match or to match later. - tye	[reply] [d/l]
Re: A NOT in regular expressions by rruiz (Monk) on May 14, 2003 at 02:56 UTC
You don't post your regular expression for us to take a look at it, but I think that you need to use the non greedy operator '?' in your re. Like: `#!/usr/bin/perl -wT use strict; my $s = 'stuff <% junk %> more stuff <% more junk %> end'; print "$s\n"; $s =~ s/<%.*?%>//g; print "$s\n";` [download] Which makes the re to take the shortest match. HTH God bless you rruiz	[reply] [d/l]
Re: A NOT in regular expressions by pzbagel (Chaplain) on May 14, 2003 at 03:02 UTC
That depends, is there any chance that % will appear in between <% and %>? If not, then something as simple as this should work: `s/<%[^%]%>>//g;` [download] However, I must warn you, because you have a 2 character delimiter (<% and %>), if you have input like: `stuff <% junk% %> more stuff` [download] Then the substitution will fail.	[reply] [d/l] [select]
Re: Re: A NOT in regular expressions by Monky Python (Scribe) on May 14, 2003 at 07:55 UTC
Hi, I think you forgot the +. `s/<%[^%]+%>//g;` [download] MP	[reply] [d/l]
Re: A NOT in regular expressions by tilly (Archbishop) on May 14, 2003 at 07:25 UTC
And your reason for not looking at the great variety of already built templating mechanisms in Perl is...?	[reply]
Re: A NOT in regular expressions by zby (Vicar) on May 14, 2003 at 08:57 UTC
Being lazy I would suggest using Regexp::Common::balanced. Taking into account all the problems with writing a regexp for balanced delimiters I wonder how it can work in the generall case, but I hope it works. Update: It's not really equivalent to the other proposed solutions. But it has a clear semantics that shoul be applicable in many cases. Here is an example: use Regexp::Common qw /balanced/; $r = qr/$RE{balanced}{-begin => "<%"}{-end => "%>"}/; @list = ("aa<% <% %>bb", "aa<% %> <% %>bb", "aa<% <% %> %>bb", "aa<% > +% %>bb"); for (@list){ /($r)/; print "$` \t\| $1 \t\| $'\n"; } [download] and the output: `aa<% \| <% %> \| bb aa \| <% %> \| <% %>bb aa \| <% <% %> %> \| bb aa \| <% >% %> \| bb` [download]	[reply] [d/l] [select]
Re: A NOT in regular expressions by BrowserUk (Patriarch) on May 14, 2003 at 09:11 UTC
No guarentees that it will catch every situation, especially as the acknowledged experts are tripping up, but this is my attempt. It seems to work for most cases I can think of and seems somewhat simpler than some of the others. `$_= '1 <% xxx%%> 2 <%%> 3 <%>%> 4 <% >% %> 5 <%%%% xxx %%%%> 6 '; s[ <% .*? (?> %> ) ][!REPLACED!]xg; print; 1 !REPLACED! 2 !REPLACED! 3 !REPLACED! 4 !REPLACED! 5 !REPLACED! 6` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: A NOT in regular expressions by Anonymous Monk on May 14, 2003 at 10:24 UTC
Thanks for your help got it working perfectly Bruce	[reply]
Re: A NOT in regular expressions by Abigail-II (Bishop) on May 14, 2003 at 08:26 UTC
`s/<%(?:[^%]+\|%[^>])*%>/<%$new_stuff%>/;` [download] Abigail	[reply] [d/l]
Re: Re: A NOT in regular expressions by tilly (Archbishop) on May 14, 2003 at 08:34 UTC
You get the string `"<%replace here%%>but not here%>"` wrong.	[reply] [d/l]
Re: A NOT in regular expressions by Abigail-II (Bishop) on May 14, 2003 at 08:49 UTC
You are quite right. I should have written: `s/<%(?:[^%]+\|%(?!>))*%>/<%$new%>/` [download] Abigail	[reply] [d/l]
Re: Re: A NOT in regular expressions by tilly (Archbishop) on May 14, 2003 at 14:51 UTC
Re: A NOT in regular expressions by Abigail-II (Bishop) on May 14, 2003 at 15:03 UTC
Some notes below your chosen depth have not been shown here