Advanced techniques with regex quantifiers

Lately I've been experimenting again with using Perl regexes more like grammars, i.e. parsing inputs via a single big regex that involves lots of branching, instead of the traditional approach of parsing inputs via imperative "spaghetti code" that sequentially matches lots of small regexes.

However, I quickly ran into two limitations relating to regex quantifiers (* + {}). Here's a write-up of the solutions/workarounds I found, both for my own benefit (so I can refer back to them), and in case others might find it interesting.

Also, I'd love to hear the opinions of other monks on which of these techniques should be used in real code, and if it would be worth adding new Perl 5 core featues to make them obsolete.

TOC:

A) Variable quantifier counts
B) Preserving capture results from all repetitions

Note: I used dd for dumping data structures, so imagine a use Data::Dump; in front of every Perl code listing below.

A) Variable quantifier counts

In the regex snippet .{4}, the token . is quantified with count 4 – i.e. it has to match four times in a row.
Unfortunately, the count has to be specified as a concrete number known at regex compile time - it cannot be a variable or expression that is read each time that part of the regex is entered. But sometimes, such a dynamic count is in fact what you need.

1. Using multiple chained match operations

The traditional solution would be to split it up into two regexes, using the /g modifier on the first regex to make sure the end position of its match is stored, and the \G assertion in the second regex to make sure it continues matching from that position:

$_ = "04abcdefgh";
my $result;

if (/ (\d\d) /gx) {
    $result .= $&;
    my $count = 0 + $1;
    
    if (/ \G .{$count} /x) { $result .= $&; }
    else { $result = undef }
}

dd $result;  #-> "04abcd"
[download]

Note that we need to manually coerce the count to a plain integer, because .{04} is invalid - it has to be .{4} (at least in Perl 5.22).

This approach is very flexible and battle-tested, but has a number of disdvantages. For one thing, it requires us to manually assemble the result string. Also, if this is supposed to be part of a larger and/or re-usable regex, having to break it up into imperative code with multiple regexes like this can be quite inconvenient - dreadfully so if backtracking is involved.

2. Using a dynamically compiled subpattern

An alternative approach is to exploit the fact that when a (??{ }) block is encountered in a regex, its contents are executed as Perl code and whatever string it returns is compiled as a regex and matched against in-place:

"04abcdefgh" =~ / (\d\d)
                  (??{ ". {".(0 + $^N)."}" }) /x;

dd $&;  #-> "04abcd"
[download]

$^N ("last capture group") is used instead of $1 to make it more generic - i.e. if it's part of a larger regex and more capture groups are added at the beginning, we won't have to re-number.

This solution has some disadvantages as well though:

We can't put only the quantifier itself inside the (??{" "}) – the whole part that is being quantified has to be included as well, because dynamically compiled fragments have to be complete valid subpatterns. In this example it's just . so that doesn't matter much, but in more complex cases that can be quite unwieldy.
It's inefficient if that portion of the regex is reached multiple times, because the subpattern needs to be recompiled each time.
We still need to manually coerce the count to a plain integer.

3. Using conditional subpattern recursion

A third approach is to combine the following four advanced regex features...

(?> PATTERN)— define a pattern that may not be backtracked into
(?{ EXPR })— assign the result of EXPR to the special variable $^R
(?(?{ EXPR }) PATTERN)— match against PATTERN only if EXPR returns true
(?-1)— recurse to the last opened capture group

...like so:

"04abcdefgh" =~ / (\d\d)
                  (?>
                    (?{ $^N })                # initialize $^R to $1
                    ( .
                      (?(?{ --$^R }) (?-1))   # recurse if --$^R > 0
                    )
                  ) /x;

dd $&;  #-> "04abcd"
[download]

Since the $^R variable gets appropriately localized during regex execution, this should work fine in regexes that do backtracking, but please test it thoroughly for your particular use-case before relying on that.

The only disadvantage I see compared to the previous approach, is increased verbosity.

4. In Perl 6 and hypothetical future Perl 5

What would the ideal solution look like?

In Perl 6, you can simply put a code block where the quantifier count is expected (note that the quantifier syntax has been changed from .{4} to . ** 4):

"04abcdefgh" ~~ / (\d\d) . ** { $0 } /;
[download]

"04abcdefgh" ~~ / (\d\d) . ** { $/[*-1] } /;
[download]

This feature could conceivably be added to Perl 5 as well, where it would look like this:

"04abcdefgh" =~ / (\d\d) .{ (?{ $1 }) } /x;
[download]

"04abcdefgh" =~ / (\d\d) .{ (?{ $^N }) } /x;
[download]

There's precedent for allowing (?{ }) code blocks in special places in Perl 5 regexes: They can be used as the (condition) of a (?(condition)yes-pattern|no-pattern) conditional (like the one used in section A.3 above).

B) Preserving capture results from all repetitions

Consider the following regex match, where the second and third capture group are inside a quantified group:

":aa2bb4cc6dd8" =~ / (:)
                     (?: (\w\w) (\d) )* /x;

dd $&;  #-> ":aa2bb4cc6dd8"
dd $1;  #-> ":"
dd $2;  #-> "dd"
dd $3;  #-> 8
[download]

As you can see, $2 only contains the last value matched by the second capture group - the "aa", "bb", "cc" values that were captured during prior iterations of the quantifier, are lost. Ditto for $3.

What if we need all of the captured values though?

1. Using multiple chained match operations

The traditional solution combines /g regexes with manual loop logic:

$_ = ":aa2bb4cc6dd8";
my @result;

if (/ : /gx) {
    $result[0] .= $&;
    
    while (/ \G (\w\w) (\d) /gx) {
        $result[0] .= $&;
        push @{$result[1]}, $1;
        push @{$result[2]}, $2;
    }
}

dd $result[0];  #-> ":aa2bb4cc6dd8"
dd $result[1];  #-> ["aa", "bb", "cc", "dd"]
dd $result[2];  #-> [2, 4, 6, 8]
[download]

The disadvantages are the same as those listed in section A.1.

2. Using embedded code to propagate results through $^R

An alternative approach is to use embedded (?{ code }) blocks to store the captured values in the special variable $^R:

":aa2bb4cc6dd8" =~ / (:)
                     (?{ [[], []] })   # initialize $^R
                     (?:
                         (\w\w) (\d)
                         
                         # add captures to $^R:
                         (?{ [[@{$^R->[0]}, $2], [@{$^R->[1]}, $3]] })
                     )*
                   /x;

dd $&;       #->  ":aa2bb4cc6dd8"
dd $1;       #->  ":"
dd $^R->[0]; #->  ["aa", "bb", "cc", "dd"]
dd $^R->[1]; #->  [2, 4, 6, 8] }
[download]

Or if you want the results to be grouped by iteration rather than capture group:

":aa2bb4cc6dd8" =~ / (:)
                     (?{ [] })   # initialize $^R
                     (?:
                         (\w\w) (\d)
                         
                         # add captures to $^R:
                         (?{ [@{$^R}, [$2, $3]] })
                     )*
                   /x;

dd $&;   #->  ":aa2bb4cc6dd8"
dd $1;   #->  ":"
dd $^R;  #->  [["aa", 2], ["bb", 4], ["cc", 6], ["dd", 8]]
[download]

I'm not sure just how well this works together with backtracking. Test thoroughly before relying on that.

3. In Perl 6, CPAN, and hypothetical future Perl 5

In Perl 6, quantified captures cause array match results:

":aa2bb4cc6dd8" ~~ / (":")
                     [ (\w\w) (\d) ]* /;

dd $0.Str;   #-> ":"
dd $1ť.Str;  #-> ("aa", "bb", "cc", "dd")
dd $2ť.Int;  #-> (2, 4, 6, 8)
[download]

In Perl 5 CPAN land, Damian Conway's Regexp::Grammars also provides a mechanism to capture repeated subrules, but it requires you to express your regex in a special grammar form.

If direct support for multiple capture results is ever added to Perl 5 core, it would probably have to be opt-in via the re pragma. It might look like this:

use re 'multi_captures';

":aa2bb4cc6dd8" =~ / (:)
                     (?: (\w\w) (\d) )* /x;

dd $&;  #-> ":aa2bb4cc6dd8"
dd $1;  #-> ":"
dd $2;  #-> ["aa", "bb", "cc", "dd"]
dd $3;  #-> [2, 4, 6, 8]
[download]

Alternatively, it could be implemented as a regex modifier (the letters b f h j k q t v w y z are still up for grabs), or as a special capture group syntax such as (?@ PATTERN ).

What do you think?

Comment on Advanced techniques with regex quantifiers Select or Download Code

Replies are listed 'Best First'.
Re: Advanced techniques with regex quantifiers by AnomalousMonk (Archbishop) on Jul 19, 2015 at 14:54 UTC
... I've been experimenting again with using Perl regexes more like grammars ... My kneejerk reaction (and, I suspect, that of many others) on reading something like this is, "Why not just use a grammar?" At least in Perl 5, regexes can be quite useful when applied to limited grammar parsing tasks, but will quickly run out of steam and fall over (or else become monstrously unwieldy) when pushed beyond a certain limit. I suspect the situation may be different with Perl 6, but I must confess ignorance here. In any event, my impression is that the Perl 6 regex compiler/engine is so radically different from that of Perl 5 that feature back-ports can only be piecemeal and incremental. Further, It's easy to come up with limited approaches to deal with oddball problems, e.g., dynamically varying quantifier counts: `c:\@Work\Perl>perl -wMstrict -MData::Dump -le "my $s = '04abcdefgh06ABCDEFGH05tuvwxyz'; my @caps = map join('', m{ \A (\d\d) }xms, m{ ([[:alpha:]]{$1}) }xms), $s =~ m{ \G \d\d [[:alpha:]]+ }xmsg ; dd @caps; " ("04abcd", "06ABCDEF", "05tuvwx")` [download] Something like this can take care of an immediate problem, but obviously isn't going to integrate well into an even moderately large parser application — and then we're back to the "get a real parser" stance. Please don't take these remarks as an attack on your interest in the problem. I share and applaud that interest and will upvote the OP as soon as I finish posting this. I love regexes too, but we must learn to recognize the limitations even of those we love most. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^2: Advanced techniques with regex quantifiers [example use-case] by smls (Friar) on Jul 19, 2015 at 23:04 UTC
Good points. It's just that when balancing elegance & performance & dependencies, in some cases I end up favouring the "one big regex" approach - assuming I can work around its technical limitations. For example, in one current use-case I want to search a large input stream for occurrences of any of several "things", and extract relevant information from each match. This is not exactly a "parser" use-case (since I don't care about most of the input stream, and don't want to build a unified AST from it) but it shares some similarities. In my prototype design (which seems to be working out fine so far), I do it by defining an array entry for each "thing" like so: `my @things = ( { parse => qr/ regex0 /sx, action => sub { ... } }, { parse => qr/ regex1 /sx, action => sub { ... } }, ... );` [download] And then when the program starts, it dynamically compiles all of those individual regexes into a big branching one of the form: `qr{ (?\| regex0 (MARK:0) \| regex1 (MARK:1) \| regex2 (MARK:2) \| ... ) }sx` [download] ...and starts matching this big regex against the input stream using a sliding window approach inspired by this old post by BrowserUk. And whenever a match is found, it calls the corresponding action function roughly like so: `$things[$main::REGMARK]{action}->();` [download] This approach is attractive to me because: Defining each thing's parsing and processing code separately but closely together, provides the best kind of encapsulation* IMO. It is efficient. Especially if each branch starts with a literal part (rather than a capture etc.) – apparently the regex compiler applies an optimization in that case which allows it to blaze through parts of the input stream with no matches. It is cleanly extensible - I can easily add more entries to `@things`. It has no non-core dependencies. Each branch/thing can have its own capture groups etc, and process them in its action function. However, if their parsing regexes need more complex quantified parts, then the advanced techniques I described above become necessary in order to be able to keep this design. If those techniques were built-in instead of requiring scary embedded code blocks, I think I would use this kind of approach more often.	[reply] [d/l] [select]
Re: Advanced techniques with regex quantifiers by AnomalousMonk (Archbishop) on Jul 19, 2015 at 12:57 UTC
A quick thought... B) Preserving capture results from all repetitions My favorite trick for this is: `c:\@Work\Perl>perl -wMstrict -MData::Dump -le "my @captures = grep defined, ':aa2bb4cc6dd8' =~ m{ \G (?<! \A) (\w\w) (\d) \| \A (:) }xmsg; dd \@captures; " [":", "aa", 2, "bb", 4, "cc", 6, "dd", 8]` [download] which even works with Perl version 5.8 and without scary `(?{ code })` blocks. Of course, output array organization is not the same as any of those shown in your examples, but this surely falls within the realm of "implementation detail"! Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re: Advanced techniques with regex quantifiers by AnomalousMonk (Archbishop) on Jul 19, 2015 at 15:27 UTC
Note that we need to manually coerce the count to a plain integer, because `.{04}` is invalid - it has to be `.{4}`. I don't see this in either 5.8.9 or 5.14.4. I thought the '010' might look like octal, but apparently not (actually, this surprises me a bit). `c:\@Work\Perl>perl -wMstrict -MData::Dump -le "my $s = '004abcdefgh009ABCDEFGHIJKLM010nopqrstuvwxyz'; ;; my @captures = $s =~ m{ (\d\d\d) ((??{ qr{.{$^N}}xms })) }xmsg; dd @captures; " ("004", "abcd", "009", "ABCDEFGHI", "010", "nopqrstuvw")` [download] (The A.1 example code also doesn't seem to need coercion of `$1` to an integer.) So while it's reasonable to think it might be true, it apparently isn't. What led you to think it was? Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^2: Advanced techniques with regex quantifiers by smls (Friar) on Jul 19, 2015 at 21:52 UTC
For me your code fails with: `Invalid quantifier in {,} in regex; marked by <-- HERE in m/.{ <-- HER +E 004}/ at -e line 1.` [download] Similarly for my own example code if I leave out the integer coercion. This is with Perl 5.22. Is this a regression then? Update: I annotated the quoted sentence in the article with a link to this discussion.	[reply] [d/l]
Re^3: Advanced techniques with regex quantifiers by AnomalousMonk (Archbishop) on Jul 20, 2015 at 00:21 UTC
For me your code fails […] with Perl 5.22. Is this a regression then? So it would seem. Apparently I need to take a look at the deltas. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l]
Re^4: Advanced techniques with regex quantifiers by Athanasius (Archbishop) on Jul 20, 2015 at 04:16 UTC
Re: Advanced techniques with regex quantifiers by Discipulus (Canon) on Jul 20, 2015 at 07:27 UTC
i think this is worth to posted in the Tutorials section. Normally meditations aimed to be tutorials have the `[RFC]` tag in the title. Then you msg to Pedagogues. Thanks for your interesting tecniques! L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Advanced techniques with regex quantifiers by smls (Friar) on Jul 20, 2015 at 09:52 UTC
What about the two "In Perl 6 and hypothetical future Perl 5" sections though, aren't they inappropriate for a tutorial? How should I handle them when converting the post to a tutorial?	[reply]
Re^3: Advanced techniques with regex quantifiers by Anonymous Monk on Jul 20, 2015 at 10:20 UTC
How should I handle them when converting the post to a tutorial? There is no "converting" :) leave original intact and Start a new node (reply ok) with "RFC: Advanced techniques with regex quantifiers" in title Make it into a good tutorial (clear purpose, clear example, clear caveats , clearly digestible ... ) Advertise in chatterbox, get commentary on it, improve/revise If you've got positive response, post (minus RFC in title) under Tutorials and wait :) See How does editing work in the Tutorials section? Compare to these examples both short Common Regex Gotchas and long Parsing with Regexes and Beyond Also avoid `<readmore>` tags in tutorials, its visual interference :)	[reply] [d/l]
Re: Advanced techniques with regex quantifiers by uhClem (Scribe) on Jul 22, 2015 at 14:09 UTC
This is the very issue I was trying to get at recently with my initiatory node validate variable-length lines in one regex?, especially part B. Approach 1 (A or B) is what I was trying to avoid (and, in the end, have had to resort to anyway as this project lacks the time for me to work up any cleverness against it); approach 2 is what I thought might exist and I'm very glad for this exposition; approach 3A seems very excellent and it'll be many projects before I'll have the fluency to attempt anything of the kind. What my problem had gotten me curious about though was the deeper and much less useful question of whether, in principle, "pure" regex can handle these situations -- whatever "pure" might mean. Solutions 2A and 2B are disqualified because including perl code (or I think any sort of eval?) injects an additional rules-world, patching the incompleteness. I lack the theory to figure out whether 3A's Advanced Features Regex counts as what I'm talking about. I scarcely have the theory to know what I'm talking about. Everyone else might not care about this; it just brings out the sophomore in me, is all. Seems as though I should look into that Regexp::Grammars. Anyway, thanks very much, smls; I like it.	[reply]