AnomalousMonk has asked for the wisdom of the Perl Monks concerning the following question:

Here's something I don't understand and would appreciate enlightenment about. The  (?=(pattern)) regex expression is pretty standard for extracting overlapping sub-strings, character pairs in the example below. However, when combined with embedded eval-ed code, the code is, for some reason unknown to me, evaluated twice. This is the case whether the eval-ed code block is inside or outside of the look-ahead group. Note that the output list of extracted matches is as expected.

The example code functions identically in ActiveState 5.8.9 and Strawberries 5.10.1.5 and 5.12.3.0, and also behaves identically if $1 is used instead of $^N.

Many thanks in advance for any insight on this question.

>perl -wMstrict -le "my $s = 'abcd'; ;; my @pairs1 = $s =~ m{ (?= (..)) (?{ printf qq{'$^N' } }) }xmsg; print ''; printf qq{:$_: } for @pairs1; print ''; ;; my @pairs2 = $s =~ m{ (?= (..) (?{ printf qq{'$^N' } })) }xmsg; print ''; printf qq{:$_: } for @pairs2; print ''; " 'ab' 'ab' 'bc' 'bc' 'cd' 'cd' :ab: :bc: :cd: 'ab' 'ab' 'bc' 'bc' 'cd' 'cd' :ab: :bc: :cd:

Replies are listed 'Best First'.
Re: Regex: Overlapping Matches: Double Execution of Eval-ed Code (pos==pos && len!=len)
by tye (Sage) on Jan 15, 2012 at 20:16 UTC

    It is one of the things I consider to be a bug in Perl's regex engine. Zero-width matches can cause Perl to consider multiple matches starting at the same point. Perl eventually rejects identical matches in order to prevent an infinite loop. Your (?{ block catches Perl given a go at trying to match something different by trying the match a second time at each starting point.

    You can work around this simply enough:

    #!perl -wl use strict; my $s = 'abcd'; my @pairs1 = $s =~ m{ (?= (..)).? (?{ printf qq{ '$^N'} }) }xmsg; # ^^ print ''; printf qq{ :$_:} for @pairs1; print ''; my @pairs2 = $s =~ m{ (?= (..) (?{ printf qq{ '$^N'} })).? }xmsg; # ^^ print ''; printf qq{ :$_:} for @pairs2; print ''; __END__ 'ab' 'bc' 'cd' :ab: :bc: :cd: 'ab' 'bc' 'cd' :ab: :bc: :cd:

    (Update: Pasted the wrong code for a few seconds.)

    - tye        

      ... work around ...

      I, also, came up with a work-around similar to the JavaFan's, although whether better than yours is another question as it needs the Special Backtracking Control Verbs of 5.10+. This was in the course of fiddling with a reply to Re^5: dice's coefficient. The idea was to avoid list generation by the regex (and it also, coincidentally, works without the /g modifier), but the results were not particularly noteworthy in terms of speed.

Re: Regex: Overlapping Matches: Double Execution of Eval-ed Code
by BrowserUk (Patriarch) on Jan 15, 2012 at 20:32 UTC

    Simplistically, it is caused by the required backtracking. How many times it gets called is also dependant upon its relative position:

    #! perl -sw use strict; use re 'eval'; my $s = 'abcdef'; our $n; $n = 0; my @groups = $s =~ m[(?{ print ++$n, ' '; })(?=(..))]g; print + $/; $n = 0; @groups = $s =~ m[(?=((?{ print ++$n, ' '; })..))]g; print + $/; $n = 0; @groups = $s =~ m[(?=(.(?{ print ++$n, ' '; }).))]g; print + $/; $n = 0; @groups = $s =~ m[(?=(..(?{ print ++$n, ' '; })))]g; print + $/; $n = 0; @groups = $s =~ m[(?=(..))(?{ print ++$n, ' '; })]g; print + $/; __END__ C:\test>junk57 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Regex: Overlapping Matches: Double Execution of Eval-ed Code
by JavaFan (Canon) on Jan 15, 2012 at 21:12 UTC
    I would write that as:
    "abcd" =~ /(..) (?{push @pairs, $1}) (*FAIL)/xs; say "@pairs"; __END__ ab bc cd
    Note the absence of zero-width matches; together with the (*FAIL), this causes the regexp engine to restart with the next character. The zero-width technique relies on the engine making an exception: a second successive zero-width match is rejected -- however, in order to determine the second match has zero length, it has to execute the block.