Melly has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monkees

I got a very useful regex in answer to a question yesterday, but I'm confused as to why it's not a greedy regex.

Consider the following (similiar) regexs - the first one is the one that's puzzling me. The second is greedy as I'd expect; the third is non-greedy as I'd expect, but I don't understand why the first one is non-greedy as well.

$_ = 'fooxbarbar'; # Not greedy - why not? if(/(foo)(((?!bar).){1,5})bar/){ print "1 Matched:$1 $2\n"; } else{ print "1 Not Matched\n"; } # Greedy - as we'd expect if(/(foo)(?!bar)((.){1,5})bar/){ print "2 Matched:$1 $2\n"; } else{ print "2 Not Matched\n"; } # Not Greedy - as we'd expect if(/(foo)(?!bar)((.){1,5}?)bar/){ print "3 Matched:$1 $2\n"; } else{ print "3 Not Matched\n"; }

Any explanation most appreciated...

Tom Melly, tom@tomandlu.co.uk

Replies are listed 'Best First'.
Re: Why isn't this regex greedy?
by demerphq (Chancellor) on Mar 17, 2006 at 10:02 UTC

    The ((?!bar).){1,5} says match any sequence of characters of up to 5 characters so long as no part of those 5 characters is part of a "bar" string. It will match the most of those chars (up to the max) that it can, which means it is greedy. The zero width assertion prevents the greedyness from overruning. Its pretty well the same thing as:

    /(foo)(.{1,5}?)bar/

    Except I'd expect the latter to be more efficient.

    ---
    $world=~s/war/peace/g

      Its pretty well the same thing as
      Well, you might want to point out how it's different. Laziness is a tendency, not a mandate. Adding a trailing "Q" to both regex shows the difference:
      "fooXbarYbarQ" =~ /(foo)(.{1,5}?)barQ/ # match entire string
      will match, skipping over the first bar because it's not followed by Q. However, the previous regex, followed by a Q will fail:
      "fooXbarYbarQ" =~ /(foo)(((?!bar).){1,5})barQ/ # won't match
      because it can't "skip over" the first bar to get to the second one.

      So, while lazy is good, it's not the only game in town, and you have to consider the rest of the regex before you know you can get away with lazy instead of inchworm.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

      Aha!

      I thought it must be something like that - the issue that I was uncertain about what was what the neg-lookahead was applied to. It would seem it applies to the .{1,5} rather than the 'foo'.

      Slightly non-intuitive IMHO, but I've got it now...

      Tom Melly, tom@tomandlu.co.uk

        "Lookahead" implies "characters following", if it was "lookbehind" it would be "characters preceding". "negative lookahead" means that the following characters can not match. "Negative lookbehind" would mean that the preceding characters can not match.

        ---
        $world=~s/war/peace/g

Re: Why isn't this regex greedy?
by GrandFather (Saint) on Mar 17, 2006 at 10:24 UTC

    (?!bar) is a negative lookahead assertion. It is zero width assertion, so it anchors the match but doesn't use any characters. (?!bar). matches a character which is not the start of a bar sequence.

    So, with that in mind: ((?!bar).){1,5} matches as many characters as it can (up to 5, and at least 1) that is not the start of the character sequence bar.


    DWIM is Perl's answer to Gödel
Re: Why isn't this regex greedy?
by kettle (Beadle) on Mar 17, 2006 at 10:33 UTC
    I may be totally off my rocker here but I think that perl will interpret:

    $_ = 'fooxbarbar'; # Not greedy - why not? if(/(foo)((?!bar).){1,5}bar/){ print "1 Matched:$1 $2\n"; } else{ print "1 Not Matched\n"; }


    exactly the same way it interprets your code from expression #1:

    $_ = 'fooxbarbar'; # Not greedy - why not? if(/(foo)(((?!bar).){1,5})bar/){ print "1 Matched:$1 $2\n"; } else{ print "1 Not Matched\n"; }


    (note the missing set of parentheses on the top, modified version of your expression)

    The output of both is, in any case, the same. The salient point being that perl is not including the
    {1,5}
    in the backreference - thus making the expression non-greedy...?

    There is a great little blurb on positive and negative lookahead that you might take a look at:

    http://www.regular-expressions.info/lookaround.html

    wonder if that makes sense....