Lately I've been experimenting again with using Perl regexes more like grammars, i.e. parsing inputs via a single big regex that involves lots of branching, instead of the traditional approach of parsing inputs via imperative "spaghetti code" that sequentially matches lots of small regexes.
However, I quickly ran into two limitations relating to regex quantifiers (* + {}). Here's a write-up of the solutions/workarounds I found, both for my own benefit (so I can refer back to them), and in case others might find it interesting.
Also, I'd love to hear the opinions of other monks on which of these techniques should be used in real code, and if it would be worth adding new Perl 5 core featues to make them obsolete.
TOC:
Note: I used dd for dumping data structures, so imagine a use Data::Dump; in front of every Perl code listing below.
In the regex snippet .{4}, the token . is quantified with count 4 – i.e. it has to match four times in a row.
Unfortunately, the count has to be specified as a concrete number known at regex compile time - it cannot be a variable or expression that is read each time that part of the regex is entered. But sometimes, such a dynamic count is in fact what you need.
The traditional solution would be to split it up into two regexes, using the /g modifier on the first regex to make sure the end position of its match is stored, and the \G assertion in the second regex to make sure it continues matching from that position:
$_ = "04abcdefgh"; my $result; if (/ (\d\d) /gx) { $result .= $&; my $count = 0 + $1; if (/ \G .{$count} /x) { $result .= $&; } else { $result = undef } } dd $result; #-> "04abcd"
Note that we need to manually coerce the count to a plain integer, because .{04} is invalid - it has to be .{4} (at least in Perl 5.22).
This approach is very flexible and battle-tested, but has a number of disdvantages. For one thing, it requires us to manually assemble the result string. Also, if this is supposed to be part of a larger and/or re-usable regex, having to break it up into imperative code with multiple regexes like this can be quite inconvenient - dreadfully so if backtracking is involved.
An alternative approach is to exploit the fact that when a (??{ }) block is encountered in a regex, its contents are executed as Perl code and whatever string it returns is compiled as a regex and matched against in-place:
"04abcdefgh" =~ / (\d\d) (??{ ". {".(0 + $^N)."}" }) /x; dd $&; #-> "04abcd"
$^N ("last capture group") is used instead of $1 to make it more generic - i.e. if it's part of a larger regex and more capture groups are added at the beginning, we won't have to re-number.
This solution has some disadvantages as well though:
A third approach is to combine the following four advanced regex features...
...like so:
"04abcdefgh" =~ / (\d\d) (?> (?{ $^N }) # initialize $^R to $1 ( . (?(?{ --$^R }) (?-1)) # recurse if --$^R > 0 ) ) /x; dd $&; #-> "04abcd"
Since the $^R variable gets appropriately localized during regex execution, this should work fine in regexes that do backtracking, but please test it thoroughly for your particular use-case before relying on that.
The only disadvantage I see compared to the previous approach, is increased verbosity.
What would the ideal solution look like?
In Perl 6, you can simply put a code block where the quantifier count is expected (note that the quantifier syntax has been changed from .{4} to . ** 4):
"04abcdefgh" ~~ / (\d\d) . ** { $0 } /;
"04abcdefgh" ~~ / (\d\d) . ** { $/[*-1] } /;
This feature could conceivably be added to Perl 5 as well, where it would look like this:
"04abcdefgh" =~ / (\d\d) .{ (?{ $1 }) } /x;
"04abcdefgh" =~ / (\d\d) .{ (?{ $^N }) } /x;
There's precedent for allowing (?{ }) code blocks in special places in Perl 5 regexes: They can be used as the (condition) of a (?(condition)yes-pattern|no-pattern) conditional (like the one used in section A.3 above).
":aa2bb4cc6dd8" =~ / (:) (?: (\w\w) (\d) )* /x; dd $&; #-> ":aa2bb4cc6dd8" dd $1; #-> ":" dd $2; #-> "dd" dd $3; #-> 8
As you can see, $2 only contains the last value matched by the second capture group - the "aa", "bb", "cc" values that were captured during prior iterations of the quantifier, are lost. Ditto for $3.
What if we need all of the captured values though?
The traditional solution combines /g regexes with manual loop logic:
$_ = ":aa2bb4cc6dd8"; my @result; if (/ : /gx) { $result[0] .= $&; while (/ \G (\w\w) (\d) /gx) { $result[0] .= $&; push @{$result[1]}, $1; push @{$result[2]}, $2; } } dd $result[0]; #-> ":aa2bb4cc6dd8" dd $result[1]; #-> ["aa", "bb", "cc", "dd"] dd $result[2]; #-> [2, 4, 6, 8]
The disadvantages are the same as those listed in section A.1.
An alternative approach is to use embedded (?{ code }) blocks to store the captured values in the special variable $^R:
":aa2bb4cc6dd8" =~ / (:) (?{ [[], []] }) # initialize $^R (?: (\w\w) (\d) # add captures to $^R: (?{ [[@{$^R->[0]}, $2], [@{$^R->[1]}, $3]] }) )* /x; dd $&; #-> ":aa2bb4cc6dd8" dd $1; #-> ":" dd $^R->[0]; #-> ["aa", "bb", "cc", "dd"] dd $^R->[1]; #-> [2, 4, 6, 8] }
Or if you want the results to be grouped by iteration rather than capture group:
":aa2bb4cc6dd8" =~ / (:) (?{ [] }) # initialize $^R (?: (\w\w) (\d) # add captures to $^R: (?{ [@{$^R}, [$2, $3]] }) )* /x; dd $&; #-> ":aa2bb4cc6dd8" dd $1; #-> ":" dd $^R; #-> [["aa", 2], ["bb", 4], ["cc", 6], ["dd", 8]]
I'm not sure just how well this works together with backtracking. Test thoroughly before relying on that.
":aa2bb4cc6dd8" ~~ / (":") [ (\w\w) (\d) ]* /; dd $0.Str; #-> ":" dd $1».Str; #-> ("aa", "bb", "cc", "dd") dd $2».Int; #-> (2, 4, 6, 8)
In Perl 5 CPAN land, Damian Conway's Regexp::Grammars also provides a mechanism to capture repeated subrules, but it requires you to express your regex in a special grammar form.
If direct support for multiple capture results is ever added to Perl 5 core, it would probably have to be opt-in via the re pragma. It might look like this:
use re 'multi_captures'; ":aa2bb4cc6dd8" =~ / (:) (?: (\w\w) (\d) )* /x; dd $&; #-> ":aa2bb4cc6dd8" dd $1; #-> ":" dd $2; #-> ["aa", "bb", "cc", "dd"] dd $3; #-> [2, 4, 6, 8]
Alternatively, it could be implemented as a regex modifier (the letters b f h j k q t v w y z are still up for grabs), or as a special capture group syntax such as (?@ PATTERN ).
What do you think?
|
|---|