comment on

A short time ago I posted a question concerning combining regexes dynamically. I'd like to report on what I ended up doing, with some observations.

The problem is "grammar like" in that it parses all the possible constructs simultaneously and notices the first one that occurs. So, all the rules are joined together with |'s.

I'm holding the rules in an array, and here are some examples from it:

 push @parts, [ 80, qr{ \{{2} $named_link \}{2} (*:image)   }xs ];
 push @parts, [ 30, q{~ (?<body> (?&link)|.|\Z  ) (*:escape)} ];
[download]

More complete details are in the follow-up post. You can see that each element in a tuple that contains a number and the single rule. There are built-in ones, some that are generated based on other configuration parameters, and finally it is extensible in that new rules can be added to the object. Adding rules means adding to this list. The number is a sort order, so you can insert your rule in the necessary location before/after others. In particular, since the "parser" doesn't do anything like longest-token-first priority, you have to put them in order if one contains another as a subset.

The rule part can be either a string or a qr object. The user could add a qr if it's just plain simpler, or a string if it can't exist in isolation. The code that joins them together adds non-capturing parens around the strings, but knows the qr's already are contained.

Now when the monster joined-together regex finds "something", the code determines what was found by using the fairly new (*MARK:NAME) feature. Looking at $REGMARK, the proper Action is fired and it can use the contents of the named captures.

But, it's not really a grammar. The match can be recursive and find the correct closing token when other stuff is nested inside it, but only the outermost thing is "found". The $REGMARK only shows the top-level production that was taken, and only the captures from that top-level production are populated. The recursive grammar recognizes the productions, but does not perform Actions in a bottom-up manner.

Instead, the top Action that did fire knows to call the parser again on the contents. So the stuff inside was parsed twice: first time just to locate the end of the enclosing construct, and the second time to actually find and act on the inner item.

I looked at Regexp::Grammars but it didn't really fit my needs. (Although after doing this, I think my code is evolving towards a grammar-based system that could use it.) It might be interesting to learn how it transformed the input into a regexp that did capture all the nested elements. Perhaps making the Action a CODE call rather than a label would give me full bottom-up grammar parsing. But part of my experiment was to use "approved" new Regexp features and not use the "dire warning" ones.

As it is, I have limited need for recursion in the constructs, and I'm not limited to strictly repeating the grammar when I process the inner part: I have a cross between a formal grammar and a recursive descent parser.

One issue with using CODE calls as the Actions is that the Regexp engine is not re-enterant. So the Action code would not be able to do much. Thus Damian's approach of just saving all the captures from the recursed branch and letting the program process the whole list after the parsing is done. But it's not suitable for making big decisions as it goes, to decide what to parse next or how to process the input.

That's all.
—John

In reply to Some Results on Composing Complex Regexes by John M. Dlugosz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.