in reply to Why is variable length lookahead implemented while lookbehind is not?

When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement. You just have to a normal match, and if the match succeeds, you reset the match position to where the look-ahead started off. And one can mark the look-ahead as not needing to backtrack.

On the other hand a look-behind requires either a backwards search, or some sufficiently advanced magic that resets the match position to the start of the string, and then matches something like (?s:.*)$look_behind\K, and anchors the end of that match to the position where the look-behind started, and then resets all the match variables as if that match has never happened, execept the captures from within $look_behind.

And even when you do that, the performance is dependent on the length of the string up to the current position, instead of just depending on the string in the vicinity of the current match position; to do it properly, you'd really need to search backwards.

Replies are listed 'Best First'.
Re^2: Why is variable length lookahead implemented while lookbehind is not?
by ikegami (Patriarch) on Sep 07, 2011 at 20:04 UTC

    When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement.

    It seems to me that the easy solution would produce bad performance. The easy solution requires matching each lookbehind subpattern at pos $current_pos - $max_match_length, which is 0 if /.*/ is used!

    'bbbbbb' =~ /(?<=[^z]*a)b/

    requires checking for "z" and "a" a total of 72 times each before failing even though the string is only 6 chars long!

    Oops, I guess you covered this at the end, but it seems to me that it's a foremost concern.

    It also forces

    'xiyjykz' =~ /(?<=x(.*)y(.*))z/s
    to produce
    $1 = 'iyj'; $2 = 'k'
    but one might expect
    $1 = 'i'; $2 = 'jyk'
      When you write a backtracking regex engine like the one in Perl 5, a full (variable length) look-ahead is pretty easy to implement.

      It seems to me that the easy solution would produce bad performance. The easy solution requires matching each lookbehind subpattern at pos $current_pos - $max_match_length

      (emphasis mine)

      It seems we're talking about two different things here. The simple approach for look-ahead is not slow (at least if you take care not to backtrack the look-ahead). And that's probably the reason it's implemented in Perl, and most advanced regex engines.

      In the case of look-behind, we're in violent agreement :-)

      My first thought for an "easy solution" is to generate a reversed version of the target string if there is a variable length look behind, then use the usual look ahead match processing on the reversed string starting at the "reversed" anchor point.

      True laziness is hard work
        That is what is discussed at Variable-length lookbehind at dev.perl.org. The suggestion to write the variable-length lookbehind in right-to-left fashion was rejected as being obscure. The Perl regex-engine will have to do this reversing itself, which is not always possible or easy to do.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Reversing the string requires reversing the regex pattern too, and that's not trivial. For example, (?<=(.).*\1) would try to match \1.*(.), but \1 hasn't been populated yet.

        Update:

        ... What am I saying! Just have people write (.).*\1. That's what they'd have to do with a right-to-left parser too. I suspect all issues can be solved with proper documentation.