Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Variable-Width Lookbehind (hacked via recursion)

by haukex (Archbishop)
on Oct 24, 2017 at 17:43 UTC ( [id://1201976]=perlmeditation: print w/replies, xml ) Need Help??

Warning: Since this uses recursion it is horribly inefficient and may easily blow up on longer strings. If you think you need this for variable-width lookbehind, then first think about how you might solve this with other techniques like lookahead, which is variable-width out of the box, or simply with multiple regular expressions. /Warning The following is presented as a curiosity as the result of the discussion here - thank you LanX and QM for providing the inspiration :-)

Zero-width Lookaround Assertions are incredibly useful, but unfortunately the lookbehind assertions (?<=pattern) and (?<!pattern) are restricted to fixed length lookbehinds, and sometimes you just really want to be able to say something like e.g. (?<=ab+.*)c. With the following technique, you can emulate these kinds of variable-width lookbehind assertions.

The basic principle is to use repeated, recursive lookbehind operations, each of which is fixed-width (either one or two characters, depending on the situation), thus "looping" backwards through the string character by character, and using zero-width lookahead assertions at each position, checking whether a match occurs or not.

The examples are a bit contrived, and are just intended to demonstrate this idea. One thing to notice is that, although you can match multiple targets in a single string via /g, there are extra capture groups that you need to ignore (I'd suggest using the long forms and named capture groups to avoid confusion). Another thing to note is that the following examples scan backwards all the way to the beginning of the string (or until a match is found), without anchoring the "recursive lookbehind" at the current match position. <update> Also, I just tested this across different versions of Perl, and it (currently) only works correctly on Perl v5.20 and above. </update>

This matches the pattern x\d+, but only when it is preceded by the pattern ab\d+ with the same \d+ part (hence the \d+(?!\d)).

my $re1 = qr{ (?<target> x (?<digits> \d+ ) (?!\d) ) (?= (?<lookback> (?<= (?! (?<match> ab \g{digits} (?!\d) ) ) . (?=(?&lookback)) . | (?=(?&match)) . . ) ) ) }msx; my $re1_short = qr / (x(\d+)(?!\d)) (?=((?<=(?!( ab\2(?!\d) )).(?=(?-2)). |(?=(?-1))..)))/sx;

This matches any duplicated characters, so the inverse of the original problem here.

my $re2 = qr{ (?<char> . ) (?= (?<lookback> (?<= (?! \g{char} ) . (?=(?&lookback)) . | (?= \g{char} ) . . ) ) ) }msx; my $re2_short = qr /(.)(?=((?<=(?!\1).(?=(?-1)).|(?=\1)..)))/sx;

If the string you're looking for doesn't depend on the match, you can write it in the following order; the following matches any x\d+ that is preceded by ab+c.

my $re3 = qr{ (?= (?<lookback> (?<= (?! (?<match> ab+c ) ) (?=(?&lookback)) . | (?=(?&match)) . ) ) ) (?<target> x \d+ ) }msx; my $re3_short = qr /(?=((?<=(?!( ab+c ))(?=(?-2)).|(?=(?-1)).))) (x\d+) /sx;

Things get a little bit shorter for single characters; this matches any \d. that is preceded by an a (anywhere in the string before it, as I said above - otherwise you could just use (?<=a)).

my $re4 = qr{ (?= (?<lookback> (?<= a | (?=(?&lookback)) . ) ) ) (?<target> \d . ) }msx; my $re4_short = qr /(?=((?<= a |(?=(?-1)).))) (\d.) /sx;

Probably someone can find a way to improve on this even more :-)

Here's the code I used to test the above regexen:

#!/usr/bin/env perl use warnings; use strict; use Test::More; # regexen here... for my $regex ($re1,$re1_short) { unlike "foo", $regex; unlike "x5", $regex; unlike "ab5", $regex; unlike "ab5 x4", $regex; unlike "x5 ab5", $regex; like "ab5 x5", $regex; like "ab5x5", $regex; like "ab51 x51", $regex; like "ab51 ab4 x5 x4", $regex; my @results; while ("ab1 ab32 x2 ab42 x3 ab3 ab4x4ab5x1x42x45" =~ /$regex/g) { push @results, $+{target} // $1 } is_deeply \@results, ["x4","x1","x42"]; } for my $regex ($re2,$re2_short) { unlike "fo", $regex; unlike "x5", $regex; unlike "ab5", $regex; unlike "ab5 x4", $regex; like "foo", $regex; like "x5 ab5", $regex; like "ab5 x5", $regex; like "ab55", $regex; like "ab51 x51", $regex; my @results; while ("abcdefbdfbb" =~ /$regex/g) { push @results, $+{char} // $1 } is_deeply \@results, ["b","d","f","b","b"]; } for my $regex ($re3,$re3_short) { unlike "foo", $regex; unlike "x5", $regex; unlike "ab5", $regex; unlike "x5 abc5", $regex; unlike "ab x4", $regex; like "abc x4", $regex; like "abc x5", $regex; like "abbbcx5", $regex; like "abbc51 x51", $regex; like "abc51 ab x5 x4", $regex; my @results; while ("x2 abbbc x4 abc5 x1 x42" =~ /$regex/g) { push @results, $+{target} // $3 } is_deeply \@results, ["x4","x1","x42"]; } for my $regex ($re4,$re4_short) { unlike "fo", $regex; unlike "x5", $regex; unlike "5ab", $regex; unlike "x5 ab5", $regex; like "ab5 x4", $regex; like "x5 ab5 x2", $regex; my @results; while ("x2 a4 x3a55aaa1" =~ /$regex/g) { push @results, $+{target} // $2 } is_deeply \@results, ["4 ","3a","55"]; } done_testing;

Minor edits for clarification.

Replies are listed 'Best First'.
Re: Variable-Width Lookbehind (hacked via recursion)
by vr (Curate) on Oct 25, 2017 at 18:39 UTC

    Nice trick :-). Perhaps looking ahead only to immediately look back isn't necessary? (Same for other regexen, + all tests are OK):

    my $re1 = qr{ (?<target> x (?<digits> \d+ ) (?!\d) ) (?<lookback> (?<= (?! (?<match> ab \g{digits} (?!\d) ) ) . (?=(?&lookback)) . | (?=(?&match)) . . ) ) }msx;

      Excellent point, thank you for spotting that! I can confirm that the (?= ) around (?<lookback> ) can be removed in all cases in the root node (since (?<= ) is already zero-width). Makes the regexes even shorter! :-)

      It's probably a vestige from the negative case like here or in the following, where the (?! (?<lookback> ... ) ) is needed*.

      # Match any /\d./ that is *not* preceded by an /a/ my $re5 = qr{ (?! (?<lookback> (?<= a | (?=(?&lookback)) . ) ) ) (?<target> \d . ) }msx; my $re5_short = qr /(?!((?<= a |(?=(?-1)).))) (\d.) /sx; for my $regex ($re5,$re5_short) { unlike "fo", $regex; unlike "x5", $regex; unlike "ab5 x4", $regex; like "5ab", $regex; like "x5 ab5", $regex; like "x5 ab5 x2", $regex; my @results; while ("x2 4x3a55aaa1" =~ /$regex/g) { push @results, $+{target} // $2 } is_deeply \@results, ["2 ","4x","3a"]; }

      * Update: Hmm, actually, it turns out this seems to work too... (although putting the exact explanation of why into words is eluding me at the moment...)

      my $re5 = qr{ (?<lookback> (?<! a | (?!(?&lookback)) . ) ) (?<target> \d . ) }msx; my $re5_short = qr /((?<! a |(?!(?-1)).)) (\d.) /sx;
        putting the explanation of why into words...

        So the two key things to note are:

        • The pattern (?<!X) (for any character X) matches at the beginning of the string (because there is no preceding character), and
        • the double negation of (?<! (?! ) ) means that whatever the inner call to (?&lookback) returns (match/no match) is what the outer (?<lookback> ) will return. So what the last, innermost (furthest left) lookback returns is what the whole, outermost lookback will return.

        So for the regex in question it boils down to two cases:

        • If there is no preceding "a", then the regex will recurse all the way to the beginning of the string, where lookback will match.
        • If there is a preceding "a", then (?<!a) will cause the match to fail.

        Minor edit for clarification.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1201976]
Approved by kcott
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-19 03:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found