Some Results on Composing Complex Regexes

A short time ago I posted a question concerning combining regexes dynamically. I'd like to report on what I ended up doing, with some observations.

The problem is "grammar like" in that it parses all the possible constructs simultaneously and notices the first one that occurs. So, all the rules are joined together with |'s.

I'm holding the rules in an array, and here are some examples from it:

 push @parts, [ 80, qr{ \{{2} $named_link \}{2} (*:image)   }xs ];
 push @parts, [ 30, q{~ (?<body> (?&link)|.|\Z  ) (*:escape)} ];
[download]

More complete details are in the follow-up post. You can see that each element in a tuple that contains a number and the single rule. There are built-in ones, some that are generated based on other configuration parameters, and finally it is extensible in that new rules can be added to the object. Adding rules means adding to this list. The number is a sort order, so you can insert your rule in the necessary location before/after others. In particular, since the "parser" doesn't do anything like longest-token-first priority, you have to put them in order if one contains another as a subset.

The rule part can be either a string or a qr object. The user could add a qr if it's just plain simpler, or a string if it can't exist in isolation. The code that joins them together adds non-capturing parens around the strings, but knows the qr's already are contained.

Now when the monster joined-together regex finds "something", the code determines what was found by using the fairly new (*MARK:NAME) feature. Looking at $REGMARK, the proper Action is fired and it can use the contents of the named captures.

But, it's not really a grammar. The match can be recursive and find the correct closing token when other stuff is nested inside it, but only the outermost thing is "found". The $REGMARK only shows the top-level production that was taken, and only the captures from that top-level production are populated. The recursive grammar recognizes the productions, but does not perform Actions in a bottom-up manner.

Instead, the top Action that did fire knows to call the parser again on the contents. So the stuff inside was parsed twice: first time just to locate the end of the enclosing construct, and the second time to actually find and act on the inner item.

I looked at Regexp::Grammars but it didn't really fit my needs. (Although after doing this, I think my code is evolving towards a grammar-based system that could use it.) It might be interesting to learn how it transformed the input into a regexp that did capture all the nested elements. Perhaps making the Action a CODE call rather than a label would give me full bottom-up grammar parsing. But part of my experiment was to use "approved" new Regexp features and not use the "dire warning" ones.

As it is, I have limited need for recursion in the constructs, and I'm not limited to strictly repeating the grammar when I process the inner part: I have a cross between a formal grammar and a recursive descent parser.

One issue with using CODE calls as the Actions is that the Regexp engine is not re-enterant. So the Action code would not be able to do much. Thus Damian's approach of just saving all the captures from the recursed branch and letting the program process the whole list after the parsing is done. But it's not suitable for making big decisions as it goes, to decide what to parse next or how to process the input.

That's all.
—John

Comment on Some Results on Composing Complex Regexes Select or Download Code

Replies are listed 'Best First'.
Re: Some Results on Composing Complex Regexes by BrowserUk (Patriarch) on May 08, 2011 at 10:23 UTC
Any chance of you posting some samples of what you needed to parse and the regex you came up with to do so? Without it, your meditation is interesting, but rather too abstract to really get an insight from. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Some Results on Composing Complex Regexes by John M. Dlugosz (Monsignor) on May 10, 2011 at 03:56 UTC
If you want to know more about this parser, this note includes some details. The project can be found at https://github.com/jdlugosz/Text--Creole. This will be a library for parsing Wiki Creole and producing xhtml or other output. The "inline formatting" (as opposed to block-level constructs) is what this grammar is for. It has to handle things like //italic// and bold, and understand that '//' might be part of an implicit link (URL) and not an italic open/close indicator. Similarly, ~ is used for escape and also can be in front of something that looks like a URL to prevent it from being auto-link-ified. The monster regexp is built with this code: my $named_link= q{ \s* (?<linkspec>[^\|]?) \s (?: (?<pipe>\\|) (?<na +me>.?) \s )? }; sub build_parser_rules { my @parts; push @parts, [ 70, q{ \{{3} \s* (?<body>.?) \s \}{3} (:nowiki) } +]; push @parts, [ 80, qr{ \{{2} $named_link \}{2} (:image) }xs ]; push @parts, [ 90, qr{ \<{3} \s* (?<body>.?) \s \>{3} (:placeholde +r) }xs ]; push @parts, [ 40, q{// \s (?<body>(?: (?&link) \| . )?) \s (?: +(?: (?<!~)//) \| \Z)(:italic) } ]; # special rules for //, skip any + links in body. push @parts, [ 30, q{~ (?<body> (?&link)\|.\|\Z ) (:escape)} ]; push @parts, [ 60, qr{\\\\ (:break)}x ]; return \@parts; } method formulate_link_rule { # formulate the 'link' rule, which includes link_prefixes which are s +et after construction. # So this is used at the last moment before the parser is created. my $linkprefix= join "\|", map { quotemeta($_) } @{$self->link_prefixe +s}; my $blend= $self->get_parse_option('blended_links') ? q{ (?(<pipe>)\|(?<blendsuffix> \w+)? ) } : ''; my $link= qr{ (?<link> (?: \[{2} $named_link \]{2} # explicit use of brackets $blend # blend suffix ) \| (?: (?<linkspec>(?: $linkprefix )://\S+?) (?= [,.?!:;"']? # ❝Single punctuation characters + (,.?!:;"') at the end of URLs should not be considered part of the U +RL.❞ (?: \Z\|\s ) ) # since I used a lazy quantifier to allow +the trailing punctuation, need to know how to end. ) )(:link) }x; return [ 10, $link ]; } method formulate_simples_rule { # formulate the 'simples' rule, which includes simple_format_tags whi +ch are set after construction. # simple_format_tags are qw[** //] for standard Creole, and can have +extensions such as qw/__ ## ^^/ for underlined, monospaced, superscri +pt, etc. "simple" means same open and close and maps to a html tag. # So this is used at the last moment before the parser is created. my $simples= join "\|", map { $_ eq '//' ? () : quotemeta($_) } (keys + %{$self->simple_format_tags}); return [ 50, q{(?<simple> (?:} . $simples . q{))\s(?<body>.?) \s* ( +?: (?: (?<!~)\k<simple>) \| \Z) (:simple)} ]; } method get_final_parser_rules { my $parser_rules= $self->parser_rules; push @$parser_rules, $self->formulate_link_rule, $self->formulate_sim +ples_rule; return $parser_rules } method _build_parser_spec { my $parser_rules= $self->get_final_parser_rules; my $branches_string= join "\n \| ", map { my $x= $$_[1]; ref $x ? $x : "(?: $x )" } (sort { $a->[0] <=> $b->[0] } @$parser_rules); my $ps= qr{(?<prematch>.?) (?: $branches_string \| \Z (*:nada) # must be the last branch ) }xs; return $ps; } [download] The `_build_parser_spec` is a lazy Moose Builder. It splices together all the rules and inserts it into the rest of the expression. `build_parser_rules` is also a Builder, that just puts together the array. This array can be extended by add-ins. The function that uses the resulting regexp is: `method do_format (Str $line) { my @results; my $ps= $self->_parser_spec; while ($line =~ /$ps/g) { my %captures= %+; my $regmark= $REGMARK; my $prematch= $captures{prematch}; push @results, $self->escape($self->filter($prematch)) unless len +gth($prematch)==0; unless ($regmark eq 'nada') { my $meth= "grammar_branch_$regmark"; push @results, $self->$meth (\%captures); } } return join ('', @results); }` [download] This uses $REGMARK to know which branch was taken and converts that to an Action in $meth. The %+ hash contains the captures, which need to be saved off before any use of the regexp engine occurs again. Finally, the resting text is gathered as individual generated pieces into an array, and joined at the very end. I think it is more efficient to join when I know how long it is finally, then to keep appending small parts. There is no reason to keep it as a contiguous string as I go.	[reply] [d/l] [select]