Variable-Width Lookbehind (hacked via recursion)

Warning: Since this uses recursion it is horribly inefficient and may easily blow up on longer strings. If you think you need this for variable-width lookbehind, then first think about how you might solve this with other techniques like lookahead, which is variable-width out of the box, or simply with multiple regular expressions. /Warning The following is presented as a curiosity as the result of the discussion here - thank you LanX and QM for providing the inspiration :-)

Zero-width Lookaround Assertions are incredibly useful, but unfortunately the lookbehind assertions (?<=pattern) and (?<!pattern) are restricted to fixed length lookbehinds, and sometimes you just really want to be able to say something like e.g. (?<=ab+.*)c. With the following technique, you can emulate these kinds of variable-width lookbehind assertions.

The basic principle is to use repeated, recursive lookbehind operations, each of which is fixed-width (either one or two characters, depending on the situation), thus "looping" backwards through the string character by character, and using zero-width lookahead assertions at each position, checking whether a match occurs or not.

The examples are a bit contrived, and are just intended to demonstrate this idea. One thing to notice is that, although you can match multiple targets in a single string via /g, there are extra capture groups that you need to ignore (I'd suggest using the long forms and named capture groups to avoid confusion). Another thing to note is that the following examples scan backwards all the way to the beginning of the string (or until a match is found), without anchoring the "recursive lookbehind" at the current match position. <update> Also, I just tested this across different versions of Perl, and it (currently) only works correctly on Perl v5.20 and above. </update>

This matches the pattern x\d+, but only when it is preceded by the pattern ab\d+ with the same \d+ part (hence the \d+(?!\d)).

my $re1 = qr{
    (?<target> x (?<digits> \d+ ) (?!\d) )
    (?= (?<lookback>
        (?<=
            (?! (?<match> ab \g{digits} (?!\d) ) ) .
            (?=(?&lookback)) .
        |
            (?=(?&match)) . .
        )
    ) )
}msx;
my $re1_short = qr
    / (x(\d+)(?!\d)) (?=((?<=(?!( ab\2(?!\d) )).(?=(?-2)).
        |(?=(?-1))..)))/sx;
[download]

This matches any duplicated characters, so the inverse of the original problem here.

my $re2 = qr{
    (?<char> . )
    (?= (?<lookback>
        (?<=
            (?! \g{char} ) .
            (?=(?&lookback)) .
        |
            (?= \g{char} ) . .
        )
    ) )
}msx;
my $re2_short = qr
    /(.)(?=((?<=(?!\1).(?=(?-1)).|(?=\1)..)))/sx;
[download]

If the string you're looking for doesn't depend on the match, you can write it in the following order; the following matches any x\d+ that is preceded by ab+c.

my $re3 = qr{
    (?= (?<lookback>
        (?<=
            (?! (?<match> ab+c ) )
            (?=(?&lookback)) .
        |
            (?=(?&match)) .
        )
    ) )
    (?<target> x \d+ )
}msx;
my $re3_short = qr
    /(?=((?<=(?!( ab+c ))(?=(?-2)).|(?=(?-1)).))) (x\d+) /sx;
[download]

Things get a little bit shorter for single characters; this matches any \d. that is preceded by an a (anywhere in the string before it, as I said above - otherwise you could just use (?<=a)).

my $re4 = qr{
    (?= (?<lookback>
        (?<=
            a
        |
            (?=(?&lookback)) .
        )
    ) )
    (?<target> \d . )
}msx;
my $re4_short = qr
    /(?=((?<= a |(?=(?-1)).))) (\d.) /sx;
[download]

Probably someone can find a way to improve on this even more :-)

Here's the code I used to test the above regexen:

#!/usr/bin/env perl
use warnings;
use strict;
use Test::More;

# regexen here...

for my $regex ($re1,$re1_short) {
    unlike "foo",            $regex;
    unlike "x5",             $regex;
    unlike "ab5",            $regex;
    unlike "ab5 x4",         $regex;
    unlike "x5 ab5",         $regex;
      like "ab5 x5",         $regex;
      like "ab5x5",          $regex;
      like "ab51 x51",       $regex;
      like "ab51 ab4 x5 x4", $regex;
    my @results;
    while ("ab1 ab32 x2 ab42 x3 ab3 ab4x4ab5x1x42x45" =~ /$regex/g)
        { push @results, $+{target} // $1 }
    is_deeply \@results, ["x4","x1","x42"];
}
for my $regex ($re2,$re2_short) {
    unlike "fo",       $regex;
    unlike "x5",       $regex;
    unlike "ab5",      $regex;
    unlike "ab5 x4",   $regex;
      like "foo",      $regex;
      like "x5 ab5",   $regex;
      like "ab5 x5",   $regex;
      like "ab55",     $regex;
      like "ab51 x51", $regex;
    my @results;
    while ("abcdefbdfbb" =~ /$regex/g)
        { push @results, $+{char} // $1 }
    is_deeply \@results, ["b","d","f","b","b"];
}
for my $regex ($re3,$re3_short) {
    unlike "foo",            $regex;
    unlike "x5",             $regex;
    unlike "ab5",            $regex;
    unlike "x5 abc5",        $regex;
    unlike "ab x4",          $regex;
      like "abc x4",         $regex;
      like "abc x5",         $regex;
      like "abbbcx5",        $regex;
      like "abbc51 x51",     $regex;
      like "abc51 ab x5 x4", $regex;
    my @results;
    while ("x2 abbbc x4 abc5 x1 x42" =~ /$regex/g)
        { push @results, $+{target} // $3 }
    is_deeply \@results, ["x4","x1","x42"];
}
for my $regex ($re4,$re4_short) {
    unlike "fo",        $regex;
    unlike "x5",        $regex;
    unlike "5ab",       $regex;
    unlike "x5 ab5",    $regex;
      like "ab5 x4",    $regex;
      like "x5 ab5 x2", $regex;
    my @results;
    while ("x2 a4 x3a55aaa1" =~ /$regex/g)
        { push @results, $+{target} // $2 }
    is_deeply \@results, ["4 ","3a","55"];
}

done_testing;
[download]

Minor edits for clarification.

Comment on Variable-Width Lookbehind (hacked via recursion) Select or Download Code

Replies are listed 'Best First'.
Re: Variable-Width Lookbehind (hacked via recursion) by vr (Curate) on Oct 25, 2017 at 18:39 UTC
Nice trick :-). Perhaps looking ahead only to immediately look back isn't necessary? (Same for other regexen, + all tests are OK): `my $re1 = qr{ (?<target> x (?<digits> \d+ ) (?!\d) ) (?<lookback> (?<= (?! (?<match> ab \g{digits} (?!\d) ) ) . (?=(?&lookback)) . \| (?=(?&match)) . . ) ) }msx;` [download]	[reply] [d/l]
Re^2: Variable-Width Lookbehind (hacked via recursion) by haukex (Archbishop) on Oct 26, 2017 at 05:04 UTC
Excellent point, thank you for spotting that! I can confirm that the `(?= )` around `(?<lookback> )` can be removed in all cases in the root node (since `(?<= )` is already zero-width). Makes the regexes even shorter! :-) It's probably a vestige from the negative case like here or in the following, where the `(?! (?<lookback> ... ) )` is needed. `# Match any /\d./ that is not* preceded by an /a/ my $re5 = qr{ (?! (?<lookback> (?<= a \| (?=(?&lookback)) . ) ) ) (?<target> \d . ) }msx; my $re5_short = qr /(?!((?<= a \|(?=(?-1)).))) (\d.) /sx; for my $regex ($re5,$re5_short) { unlike "fo", $regex; unlike "x5", $regex; unlike "ab5 x4", $regex; like "5ab", $regex; like "x5 ab5", $regex; like "x5 ab5 x2", $regex; my @results; while ("x2 4x3a55aaa1" =~ /$regex/g) { push @results, $+{target} // $2 } is_deeply \@results, ["2 ","4x","3a"]; }` [download] * Update: Hmm, actually, it turns out this seems to work too... (although putting the exact explanation of why into words is eluding me at the moment...) `my $re5 = qr{ (?<lookback> (?<! a \| (?!(?&lookback)) . ) ) (?<target> \d . ) }msx; my $re5_short = qr /((?<! a \|(?!(?-1)).)) (\d.) /sx;` [download]	[reply] [d/l] [select]
Re^3: Variable-Width Lookbehind (hacked via recursion) by haukex (Archbishop) on Oct 30, 2017 at 17:20 UTC
putting the explanation of why into words... So the two key things to note are: The pattern `(?<!X)` (for any character `X`) matches at the beginning of the string (because there is no preceding character), and the double negation of `(?<! (?! ) )` means that whatever the inner call to `(?&lookback)` returns (match/no match) is what the outer `(?<lookback> )` will return. So what the last, innermost (furthest left) `lookback` returns is what the whole, outermost `lookback` will return. So for the regex in question it boils down to two cases: If there is no preceding "`a`", then the regex will recurse all the way to the beginning of the string, where `lookback` will match. If there is a preceding "`a`", then `(?<!a)` will cause the match to fail. Minor edit for clarification.	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks