comment on

Lately I've been experimenting again with using Perl regexes more like grammars, i.e. parsing inputs via a single big regex that involves lots of branching, instead of the traditional approach of parsing inputs via imperative "spaghetti code" that sequentially matches lots of small regexes.

However, I quickly ran into two limitations relating to regex quantifiers (* + {}). Here's a write-up of the solutions/workarounds I found, both for my own benefit (so I can refer back to them), and in case others might find it interesting.

Also, I'd love to hear the opinions of other monks on which of these techniques should be used in real code, and if it would be worth adding new Perl 5 core featues to make them obsolete.

TOC:

A) Variable quantifier counts
B) Preserving capture results from all repetitions

Note: I used dd for dumping data structures, so imagine a use Data::Dump; in front of every Perl code listing below.

A) Variable quantifier counts

In the regex snippet .{4}, the token . is quantified with count 4 – i.e. it has to match four times in a row.
Unfortunately, the count has to be specified as a concrete number known at regex compile time - it cannot be a variable or expression that is read each time that part of the regex is entered. But sometimes, such a dynamic count is in fact what you need.

1. Using multiple chained match operations

The traditional solution would be to split it up into two regexes, using the /g modifier on the first regex to make sure the end position of its match is stored, and the \G assertion in the second regex to make sure it continues matching from that position:

$_ = "04abcdefgh";
my $result;

if (/ (\d\d) /gx) {
    $result .= $&;
    my $count = 0 + $1;
    
    if (/ \G .{$count} /x) { $result .= $&; }
    else { $result = undef }
}

dd $result;  #-> "04abcd"
[download]

Note that we need to manually coerce the count to a plain integer, because .{04} is invalid - it has to be .{4} (at least in Perl 5.22).

This approach is very flexible and battle-tested, but has a number of disdvantages. For one thing, it requires us to manually assemble the result string. Also, if this is supposed to be part of a larger and/or re-usable regex, having to break it up into imperative code with multiple regexes like this can be quite inconvenient - dreadfully so if backtracking is involved.

2. Using a dynamically compiled subpattern

An alternative approach is to exploit the fact that when a (??{ }) block is encountered in a regex, its contents are executed as Perl code and whatever string it returns is compiled as a regex and matched against in-place:

"04abcdefgh" =~ / (\d\d)
                  (??{ ". {".(0 + $^N)."}" }) /x;

dd $&;  #-> "04abcd"
[download]

$^N ("last capture group") is used instead of $1 to make it more generic - i.e. if it's part of a larger regex and more capture groups are added at the beginning, we won't have to re-number.

This solution has some disadvantages as well though:

We can't put only the quantifier itself inside the (??{" "}) – the whole part that is being quantified has to be included as well, because dynamically compiled fragments have to be complete valid subpatterns. In this example it's just . so that doesn't matter much, but in more complex cases that can be quite unwieldy.
It's inefficient if that portion of the regex is reached multiple times, because the subpattern needs to be recompiled each time.
We still need to manually coerce the count to a plain integer.

3. Using conditional subpattern recursion

A third approach is to combine the following four advanced regex features...

(?> PATTERN)— define a pattern that may not be backtracked into
(?{ EXPR })— assign the result of EXPR to the special variable $^R
(?(?{ EXPR }) PATTERN)— match against PATTERN only if EXPR returns true
(?-1)— recurse to the last opened capture group

...like so:

"04abcdefgh" =~ / (\d\d)
                  (?>
                    (?{ $^N })                # initialize $^R to $1
                    ( .
                      (?(?{ --$^R }) (?-1))   # recurse if --$^R > 0
                    )
                  ) /x;

dd $&;  #-> "04abcd"
[download]

Since the $^R variable gets appropriately localized during regex execution, this should work fine in regexes that do backtracking, but please test it thoroughly for your particular use-case before relying on that.

The only disadvantage I see compared to the previous approach, is increased verbosity.

4. In Perl 6 and hypothetical future Perl 5

What would the ideal solution look like?

In Perl 6, you can simply put a code block where the quantifier count is expected (note that the quantifier syntax has been changed from .{4} to . ** 4):

"04abcdefgh" ~~ / (\d\d) . ** { $0 } /;
[download]

"04abcdefgh" ~~ / (\d\d) . ** { $/[*-1] } /;
[download]

This feature could conceivably be added to Perl 5 as well, where it would look like this:

"04abcdefgh" =~ / (\d\d) .{ (?{ $1 }) } /x;
[download]

"04abcdefgh" =~ / (\d\d) .{ (?{ $^N }) } /x;
[download]

There's precedent for allowing (?{ }) code blocks in special places in Perl 5 regexes: They can be used as the (condition) of a (?(condition)yes-pattern|no-pattern) conditional (like the one used in section A.3 above).

B) Preserving capture results from all repetitions

Consider the following regex match, where the second and third capture group are inside a quantified group:

":aa2bb4cc6dd8" =~ / (:)
                     (?: (\w\w) (\d) )* /x;

dd $&;  #-> ":aa2bb4cc6dd8"
dd $1;  #-> ":"
dd $2;  #-> "dd"
dd $3;  #-> 8
[download]

As you can see, $2 only contains the last value matched by the second capture group - the "aa", "bb", "cc" values that were captured during prior iterations of the quantifier, are lost. Ditto for $3.

What if we need all of the captured values though?

1. Using multiple chained match operations

The traditional solution combines /g regexes with manual loop logic:

$_ = ":aa2bb4cc6dd8";
my @result;

if (/ : /gx) {
    $result[0] .= $&;
    
    while (/ \G (\w\w) (\d) /gx) {
        $result[0] .= $&;
        push @{$result[1]}, $1;
        push @{$result[2]}, $2;
    }
}

dd $result[0];  #-> ":aa2bb4cc6dd8"
dd $result[1];  #-> ["aa", "bb", "cc", "dd"]
dd $result[2];  #-> [2, 4, 6, 8]
[download]

The disadvantages are the same as those listed in section A.1.

2. Using embedded code to propagate results through $^R

An alternative approach is to use embedded (?{ code }) blocks to store the captured values in the special variable $^R:

":aa2bb4cc6dd8" =~ / (:)
                     (?{ [[], []] })   # initialize $^R
                     (?:
                         (\w\w) (\d)
                         
                         # add captures to $^R:
                         (?{ [[@{$^R->[0]}, $2], [@{$^R->[1]}, $3]] })
                     )*
                   /x;

dd $&;       #->  ":aa2bb4cc6dd8"
dd $1;       #->  ":"
dd $^R->[0]; #->  ["aa", "bb", "cc", "dd"]
dd $^R->[1]; #->  [2, 4, 6, 8] }
[download]

Or if you want the results to be grouped by iteration rather than capture group:

":aa2bb4cc6dd8" =~ / (:)
                     (?{ [] })   # initialize $^R
                     (?:
                         (\w\w) (\d)
                         
                         # add captures to $^R:
                         (?{ [@{$^R}, [$2, $3]] })
                     )*
                   /x;

dd $&;   #->  ":aa2bb4cc6dd8"
dd $1;   #->  ":"
dd $^R;  #->  [["aa", 2], ["bb", 4], ["cc", 6], ["dd", 8]]
[download]

I'm not sure just how well this works together with backtracking. Test thoroughly before relying on that.

3. In Perl 6, CPAN, and hypothetical future Perl 5

In Perl 6, quantified captures cause array match results:

":aa2bb4cc6dd8" ~~ / (":")
                     [ (\w\w) (\d) ]* /;

dd $0.Str;   #-> ":"
dd $1ť.Str;  #-> ("aa", "bb", "cc", "dd")
dd $2ť.Int;  #-> (2, 4, 6, 8)
[download]

In Perl 5 CPAN land, Damian Conway's Regexp::Grammars also provides a mechanism to capture repeated subrules, but it requires you to express your regex in a special grammar form.

If direct support for multiple capture results is ever added to Perl 5 core, it would probably have to be opt-in via the re pragma. It might look like this:

use re 'multi_captures';

":aa2bb4cc6dd8" =~ / (:)
                     (?: (\w\w) (\d) )* /x;

dd $&;  #-> ":aa2bb4cc6dd8"
dd $1;  #-> ":"
dd $2;  #-> ["aa", "bb", "cc", "dd"]
dd $3;  #-> [2, 4, 6, 8]
[download]

Alternatively, it could be implemented as a regex modifier (the letters b f h j k q t v w y z are still up for grabs), or as a special capture group syntax such as (?@ PATTERN ).

What do you think?

In reply to Advanced techniques with regex quantifiers by smls

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.