davido has asked for the wisdom of the Perl Monks concerning the following question:

While working on Re: string substitution a thought came to me. Perhaps it is more of a meditation than a question, but I was intimidated by the thought of posting such trivia next to Abigail-II's The N-queens problem using pure regexes.

If \b, a zero-width assertion, doesn't care whether it's placed before or after the word it's guarding (so to speak), why should (?= ) care that it is for lookahead only? What is it that makes (?<= ) necessary, and (?= ) unable to fill both niches?

Then the concept comes to mind of the ^ and $ anchors. They're zero-width, and unlike \b, they care about which one gets used, and where. But maybe through habbit, that idiom seems to make sense and lend clarity. In other regexp flavors we see things like \< and \> to represent \b in a more placement-sensitive way. Perl seems fine with the simple \b. Wouldn't Perl be "just fine" with (?= ) filling the roll as both a zero width lookahead and a zero width lookbehind?

Then there's the other question.... why does (?<= ) have to contain fixed-length matches (ie, no quantifiers like + or *, and no lopsided alternation), while (?= ) is allowed to contain quantifiers or lopsided alternations (in other words, doesn't have to be fixed length)?

I've dug into Camel II, Owls, perlre, and perlretut but haven't found the answer. Perhaps the answer is just "that's the way it is." Is it? ;)


Dave


"If I had my life to do over again, I'd be a plumber." -- Albert Einstein
  • Comment on Why do zero width assertions care about lookahead/behind?

Replies are listed 'Best First'.
•Re: Why do zero width assertions care about lookahead/behind?
by merlyn (Sage) on Oct 08, 2003 at 19:33 UTC
    That's \b is defined as both lookahead and lookbehind, but the explicit operators give you each option separately. To create \b, you need something like:
    /(?: (?<!\w)(?=\w) | (?<=\w)(?!\w) )/x
    In other words, either lookbehind and don't find a \w, then lookahead and do, or look behind and find a \w, and lookahead and don't find one.

    So the fact that it can be placed before or after a "word" is because it has an implicit or in the definition.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Why do zero width assertions care about lookahead/behind?
by liz (Monsignor) on Oct 08, 2003 at 19:00 UTC
    If you think a bit about what variable length look behind means for backtracking, you may realize why it's not in Perl right now.

    If you really want variable length look behind and only need fixed length look ahead, reverse your string and adapt your regex accordingly.

    Liz

      OK, I thought about what variable-length look behind means for backtracking. The answer is that if your regular expression engine has a mode where it scans through the string backwards that it goes into at the right time, then variable-length look behind is the same impact as variable-length look ahead.

      Yes, implementing this is a pain. I am not willing to implement it. So I am not willing to complain that it doesn't exist in Perl.

      However I will admit to being irritated that Java's RE engine supports variable length look behind while Perl's doesn't.

        ...has a mode where it scans through the string backwards...

        The way I see it, is that the whole regular expression engine should be able to go either forward or backward. At any point of its execution. To be really able to have variable length look-behind.

        (?<=foo.*?) # if there is a "foo" before the match
        is but the beginning. How about:
        <?<=foo.*?(?=.*?bar)) # if there is a "foo" before, with a "bar" afte +r it

        To do this properly, you would need to rewrite the whole regex engine, I'm afraid (and what I remember from one of MJD's talks about how he hacked the regular expression engine). This may/will happen for Perl6. So I guess we'll have to hold our breath until then.

        Meanwhile, I wonder whether you wouldn't be able to hack something with (?{code}) and/or (??{code}), grabbing the string before the match so far, reversing it and feeding that a reversed regex. Possibly in combination with a source filter.

        Hmmm... maybe I shouldn't have these evil thoughts this early in the morning. ;-)

        Liz

Re: Why do zero width assertions care about lookahead/behind?
by Abigail-II (Bishop) on Oct 08, 2003 at 22:50 UTC
    If \b, a zero-width assertion, doesn't care whether it's placed before or after the word it's guarding (so to speak), why should (?= ) care that it is for lookahead only? What is it that makes (?<= ) necessary, and (?= ) unable to fill both niches?

    You seem to be under the assumption that (?= ) and (?<= ) match the same thing. That isn't true, as the program below shows:

    #!/usr/bin/perl use strict; use warnings; no warnings qw /syntax/; $_ = "---abc"; print "Lookahead matches\n" if /(?=abc)\w/; print "Lookbehind matches\n" if /(?<=abc)\w/; __END__ Lookahead matches

    See, if (?= ) and (?<= ) are different, no construct could replace them both.

    Abigail

      Hi Abigail, I am a bit confused with the look behind match in your example, I believe that /(?<=abc)\w/ should never match because the look behind does not include the character at the cursor. Does this means that if I want to do a look behind match, I have to know more precisely about the position that I look behind from? I have changed the look behind match in your example to /(?<=abc)\b/, which seems to match "---abc" in the example you have given.

        Both /(?=abc)\b/ and /(?<=abc)\b/ match the string "---abc", but not at the same position, as the program below shows:
        #!/usr/bin/perl use strict; use warnings; $_ = "---abc"; /(?=abc)\b/ and print "Lookahead \$` = '$`'; \$' = '$''\n"; /(?<=abc)\b/ and print "Lookbehind \$` = '$`'; \$' = '$''\n"; __END__ Lookahead $` = '---'; $' = 'abc' Lookbehind $` = '---abc'; $' = ''
        /(?=abc)\b/ matches at a word boundary that is followed by "abc", hence after "---", but before "abc". /(?<=abc)\b/ matches at a word boundary that is preceded by "abc", hence after "abc", and right before the end of the string.

        Abigail

Re: Why do zero width assertions care about lookahead/behind?
by QM (Parson) on Oct 08, 2003 at 21:31 UTC
    Your description of \b is flawed.

    \b matches only if each side is different from the other (or the trivial case where there is only one side -- but note that it won't match the empty string). It's already looking on both sides!

    On the other hand, if you only want \W\b\w, but not \w\b\W, you'll have to do that yourself. [I suppose someone could come up with the backslash equivalents for these.]

    In a complementary vein, if you want (?=...) to look both ways, how do you ask for only lookahead, or only lookbehind?

    -QM
      Yes, I noticed the error in my description of \b after having posted, and despite recent discussions, the ability for laypeople to edit parent nodes still doesn't exist. It would have been more accurate for me to say that \b doesn't care about whether it's being used at the beginning of a word or at the end of a word. And my point was, why should (?=...) care whether it's being used at the beginning or the end of the text it anchors its assertion to.

      You are accurate that \b looks at both sides. I think that merlyn provided the perfect clarification; \b has alternation built into it. It either looks like (?<=\W)(?=\w) or like (?<=\w)(?=\W) depending on whether it's being used at the beginning or the end of a word.

      So \b is not a simple lookahead or lookbehind assertion, it is a complex lookbehind/ahead in alternation with an opposing lookbehind/ahead.

      Similar assertions could be custom written too. Say for example I wanted to create \x (my new metacharacter that means boundry between space and nonspace). Well, I can't name it \x, but I suppose I could name it $x. But whatever I call it, the definition would be: (?:(?<=\s)(?=\S))|(?:(?<=\S)(?=\s))

      As far as what direction (?=...) looks, I didn't really want it to look both ways at once. I just thought it a little confusing that it could only look ahead. Without the prior benefit of merlyn's reply, thus not fully understanding the \b example, it seemed odd that (?=...) should be incapable of being used for lookbehind just as easily as lookahead. I understand that distinction now.

      As for my other comment regarding the fact that (?<=...) must be fixed-width, I understand that as liz stated, it would be a backtracking nightmare if such were not the case, but still don't fully understand why that is so. I'll have to re-read the section in the Owls book about DFA engines and backtracking. Eventually it will sink in. ;)


      Dave


      "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
Re: Why do zero width assertions care about lookahead/behind?
by BrowserUk (Patriarch) on Oct 09, 2003 at 02:15 UTC

    I think a better explanation of why \b can be used as both a "lookahead" and a "lookbehind" assertion, is that not only is it zero-width in terms of the number of places it advances the current pointer as is the case with (?=), (?!), (?<=) & (?<!).

    It is also "zero-width" in terms of the number of characters it matches, which the others aren't.

    That is to say, \b doesn't match characters, it matches the transition between two characters. And the transition being measured is always a straight comparison between the character before the pointer and the character after, only the sense of the comparison is changes.

    Thus, \b is neither a lookahead nor a lookbehind assertion, it's more a "lookhere" or maybe "lookbetween" assertion.

    As an interesting exercise to prove that \b is neither a lookahead nor a lookbehind assertion, try creating one of either than can be used as a substitute for \b.

    I realise that my explanation will probably not fit anyone elses mental model, but it works for me:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      As an interesting exercise to prove that \b is neither a lookahead nor a lookbehind assertion, try creating one of either than can be used as a substitute for \b.
      I did, proving your assertion wrong.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        merlyn++. Neat! It's even a lot quicker than I thought it would be.

        $text = do{ local $/; open W, '<dutch.words' or warn $!; <W> }; print length $text; 2603063 use Benchmark qw[ cmpthese ]; cmpthese( 1, { std=> q[ @words1 = $text =~ m[\b(\w+)\b]g; ], merlyn=> q[ @words2 = $text =~ m[ (?: (?<!\w)(?=\w) | (?<=\w)(?!\w) ) (\w+) (?: (?<!\w)(?=\w) | (?<=\w)(?!\w) ) ]xg ] }); (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) s/iter merlyn std merlyn 25.9 -- -50% std 13.0 98% -- print scalar @words1, ' : ', scalar @words2; 227248 : 227248
        ...proving your assertion wrong.

        It could be argued that my assertion that "that \b is neither a lookahead nor a lookbehind assertion, try creating one of either" is correct... in that.

        • It isn't either, it's both!
        • And your neat construct is actually two of each:) and I said "...one of either...".

        but that would probably be a fruitless argument, and takes nothing away from your construction. Once again, it's neat :)


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!