in reply to Regex'ing backreferences

I think the problem here is to say that you do NOT want bar in $1. Because if you wanted it, it may be easier :-)

Something like this should hopefully do want you want:

/foo(?!.*?bar.*?)(.*?)foo/;
You first try to find 'foo'. OK. Then you check if the substring between 'foo' and the other 'foo' contains 'bar', but with a zero-width assertion. If that does not contain 'bar', then you can save it into $1.

Thank you :-) thanks to your question I have understood why zero-witdh assertion is really cool!!

Hope this helps. Bye

James

Update: and what about doing:

/foo(?!.*?bar.*?foo)(.*?)foo/;
I think it is better, unless you show me another "worst case" :-))

Replies are listed 'Best First'.
RE (tilly) 2: Regex'ing backreferences
by tilly (Archbishop) on Sep 21, 2000 at 14:37 UTC
    You unfortunately will disallow strings with bar *after* the second foo. What you want is this:
    /foo((?:(?!bar).)*)foo/
    Note that (?:stuff) is a non-grouping set of parens, and (?!stuff) is a negative assertion.

    UPDATE
    I noticed a while ago that my RE was incorrect, decided to leave it for someone else to find and mention. But nobody did so I will mention that I forgot a ?:

    /foo((?:(?!bar).)*?)foo/
      While this method works, it is a bit slow, since it checks for the existence of 'bar' EVERY character. Use the "unrolling the loop" technique, as found in "Mastering Regular Expressions":

      Update: tilly's right, mine fails to match on a specific type of string ("foobabfoo"). Fixed below.
      m{ foo ( # save to $1 [^b]* # 0 or more non-b characters (?: b+ # 1 or more b's (?: a(?!r) # an 'a' NOT followed by an 'r' | # or [^a] # a non-a character ) [^b]* # 0 or more non-b characters )* # that whole b... subset, 0 or more times ) foo # and foo }x;
      The general pattern behind unrolling the loop is to get your regex to look something like OK* ( NOTOK OK* )*.

      $_="goto+F.print+chop;\n=yhpaj";F1:eval
        Unrolling the loop is faster, but it is also more prone to mistakes. My version is the simplest that is legible yet does the specified job. My personal preference would be to wait to unroll it until after I find that I truly need to do so.

        I rarely do. Unlike lookaheads with wildcards in them, the underlying implementation of mine is worst case off by a constant factor from the best possible. (The worst case, of course, being things like "foobababababafoo".) Furthermore my hope is that someday (not today of course) Perl's RE engine won't need such mechanical optimizations since it will do them on the fly.

        And a question for you. While you unrolled it correctly, how many people here do you think would have got it right on their first or second try? (Or ever - most in my bet would have stopped before getting it right.) If your goal is correct code (which mine is) then be *very* careful before going through optimizations like this.

        UPDATE
        You did not, in fact, unroll it correctly. The string I gave above of a "worst case" in fact should match and does not match your regex!

        Your current regex is:
        m{ foo ( # save to $1 [^b]* # 0 or more non-b characters (?: b+ # 1 or more b's (?: a(?!r) # an 'a' NOT followed by an 'r' | # or [^a] # a non-a character ) [^b]* # 0 or more non-b characters )* # that whole b... subset, 0 or more times ) foo # and foo }x;
        That incorrectly matches "foobbarfoo".

        Again, optimization is the root of all evil. :-)

RE (tilly) 2 (still wrong): Regex'ing backreferences
by tilly (Archbishop) on Sep 21, 2000 at 18:27 UTC
    What about "foohellofoobarfoo"? You should be matching and putting "hello" into $1, but your new version won't.