Re: Regex code block executes twice per match using look-arounds

Replies are listed 'Best First'.
Re^2: Regex code block executes twice per match using look-arounds by johngg (Canon) on Jul 12, 2007 at 13:14 UTC
Thank you for the reply. I'm obviously not understanding something as I thought my `(?{...})` block was at the very end of the regular expression. I didn't think it would be encountered and triggered unless both look-behind and look-ahead succeeded. I expected the matching to go something like this `Initialise pointer to beginning of string test look-behind, nothing preceding so fails advance pointer one place test look-behind, '{' so fails advance pointer one place test look-behind, 'x' so fails advance pointer one place test look-behind, '1' so fails advance pointer one place test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute advance pointer one place test look-behind, '[' so fails ...` [download] If the DFA had to look past my code block to check something else and then backtrack, the behaviour would make sense. However, I can't see how that is happening in this case. I'm sure it must be me failing to get my head around something fundamental. Please could you point out where my understanding is lacking. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: Regex code block executes twice per match using look-arounds by ikegami (Patriarch) on Jul 12, 2007 at 18:33 UTC
`test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute advance pointer one place` [download] That last step is wrong. It should be `advance pointer the length of the match` [download] The length of the match is zero in the case where the lookahead and lookbehind is used. Since the pointer is not advanced, the regexp matches everything twice. Only a final check prevents the regexp from returning an identical match. `test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute same match? no, continue advance pointer the length of the match (0) test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute same match? yes, backtrack` [download] More on this in Re: Regex code block executes twice per match using look-arounds.	[reply] [d/l] [select]
Re^4: Regex code block executes twice per match using look-arounds by johngg (Canon) on Jul 12, 2007 at 19:20 UTC
`advance pointer the length of the match` That was the fundamental piece of the puzzle that I was missing. Indeed, after jettero's 2nd reply I started to investigate with `use re 'debug';` and could see the mechanism you describe. I changed the regex so that it was using the look-behind but the look-ahead was replaced with a simple capture `... my $rxBetween = qr {(?x) (?<=($rxClose)) ($rxOpen) (?{print qq{Match @{ [++ $count] }: on left $1, on right $2\n}} +) }; ... $string =~ s{$rxBetween}{+$2}g; ...` [download] and that stopped the double execution. It also seemed clear from the debug output that using both look-arounds was making the engine do a lot more work. Read more... (11 kB) Thank you for your replies and the insights they have given. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: Regex code block executes twice per match using look-arounds by jettero (Monsignor) on Jul 12, 2007 at 15:43 UTC
Yes, well, the zero-width of the look behind and look ahead is clearly causing it to do something other than the obvious. I suspect the NFA (thanks sgt, interesting) is looking for something after the zero width thingies... Perhaps it would behave less oddly if you put a '.' before your code? Who really knows what the FA is really doing? Any way you look at it, your regex is unusual since all of the expressions are zero-width, so they kinda* match? And the when of code embedded in the regex isn't all that well defined I bet since the perlre page calls it experimental... I don't think it's changed in the last 5 years. I wonder when it'll stop being experimental. -Paul	[reply]
Re^2: Regex code block executes twice per match using look-arounds by sgt (Deacon) on Jul 12, 2007 at 15:02 UTC
Just a minor pick: perl (ir)regular expressions are usually put in the NFA category meaning essentially some pathological (and not-so pathological) patterns can run for a long time (like double loops: '(blah.*)+'). This is a trade-off for getting more useful features. The Owl book (Mastering Regular Expressions) talks about 3 categories: DFA, NFA, POSIX NFA. In an alternation pattern, perl uses the first alternation to match, without checking for a possible longer match, making it non POSIX. There was an interesting discussion in p5p a few months ago motivated by the following article regex matching can be simple and fast but... One conclusion was that using some hybrid DFA/NFA scheme like the one used by gawk which uses GNU regex library could be nice, when you want to garantee a decent running time and are not using features that force the NFA engine to kick in. You'll get the best of all worlds with 5.10 that allows pluggable regex libraries :) cheers --stephan Note: the compiled form of a regex is usually quite different in both schemes, furthermore the mathematical equivalence between NFA and DFA possibly useful for simple models of regex (few operators) is not used.	[reply]