in reply to Regex code block executes twice per match using look-arounds

That is actually the usual behavior. The regex is usually implemented as a DFA (Finite State Machine), and occasionally they have to backtrack to complete the match. Your code is executed at the time the DFA moves through your (?{}) block even if it doesn't make it to the end which is when the DFA "matches."

In this case, it is your lookbehind that's causing the DFA to remember something, wander forward to see if it's relevant, and wander back when it isn't... If you put your (?{}) block at the very end it would only print when it actually finds a match.

-Paul

  • Comment on Re: Regex code block executes twice per match using look-arounds

Replies are listed 'Best First'.
Re^2: Regex code block executes twice per match using look-arounds
by johngg (Canon) on Jul 12, 2007 at 13:14 UTC
    Thank you for the reply. I'm obviously not understanding something as I thought my (?{...}) block was at the very end of the regular expression. I didn't think it would be encountered and triggered unless both look-behind and look-ahead succeeded. I expected the matching to go something like this

    Initialise pointer to beginning of string test look-behind, nothing preceding so fails advance pointer one place test look-behind, '{' so fails advance pointer one place test look-behind, 'x' so fails advance pointer one place test look-behind, '1' so fails advance pointer one place test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute advance pointer one place test look-behind, '[' so fails ...

    If the DFA had to look past my code block to check something else and then backtrack, the behaviour would make sense. However, I can't see how that is happening in this case.

    I'm sure it must be me failing to get my head around something fundamental. Please could you point out where my understanding is lacking.

    Cheers,

    JohnGG

      test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute advance pointer one place

      That last step is wrong. It should be

      advance pointer the length of the match

      The length of the match is zero in the case where the lookahead and lookbehind is used. Since the pointer is not advanced, the regexp matches everything twice. Only a final check prevents the regexp from returning an identical match.

      test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute same match? no, continue advance pointer the length of the match (0) test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute same match? yes, backtrack

      More on this in Re: Regex code block executes twice per match using look-arounds.

        advance pointer the length of the match

        That was the fundamental piece of the puzzle that I was missing. Indeed, after jettero's 2nd reply I started to investigate with use re 'debug'; and could see the mechanism you describe.

        I changed the regex so that it was using the look-behind but the look-ahead was replaced with a simple capture

        ... my $rxBetween = qr {(?x) (?<=($rxClose)) ($rxOpen) (?{print qq{Match @{ [++ $count] }: on left $1, on right $2\n}} +) }; ... $string =~ s{$rxBetween}{+$2}g; ...

        and that stopped the double execution. It also seemed clear from the debug output that using both look-arounds was making the engine do a lot more work.

        Thank you for your replies and the insights they have given.

        Cheers,

        JohnGG

      Yes, well, the zero-width of the look behind and look ahead is clearly causing it to do something other than the obvious. I suspect the NFA (thanks sgt, interesting) is looking for something after the zero width thingies... Perhaps it would behave less oddly if you put a '.' before your code? Who really knows what the *FA is really doing?

      Any way you look at it, your regex is unusual since all of the expressions are zero-width, so they kinda match? And the when of code embedded in the regex isn't all that well defined I bet since the perlre page calls it experimental...

      I don't think it's changed in the last 5 years. I wonder when it'll stop being experimental.

      -Paul

Re^2: Regex code block executes twice per match using look-arounds
by sgt (Deacon) on Jul 12, 2007 at 15:02 UTC

    Just a minor pick: perl (ir)regular expressions are usually put in the NFA category meaning essentially some pathological (and not-so pathological) patterns can run for a long time (like double loops: '(blah.*)+'). This is a trade-off for getting more useful features.

    The Owl book (Mastering Regular Expressions) talks about 3 categories: DFA, NFA, POSIX NFA. In an alternation pattern, perl uses the first alternation to match, without checking for a possible longer match, making it non POSIX.

    There was an interesting discussion in p5p a few months ago motivated by the following article regex matching can be simple and fast but... One conclusion was that using some hybrid DFA/NFA scheme like the one used by gawk which uses GNU regex library could be nice, when you want to garantee a decent running time and are not using features that force the NFA engine to kick in. You'll get the best of all worlds with 5.10 that allows pluggable regex libraries :)

    cheers --stephan Note: the compiled form of a regex is usually quite different in both schemes, furthermore the mathematical equivalence between NFA and DFA possibly useful for simple models of regex (few operators) is not used.