Re^2: Simplifying regexes

Parse::RecDescent does take some time to get to know, and I wish that the tutorials were better. But the key concept is that it is very useful when you have a complex, naturally structured input. An input in which it is easy to describe, in the form of regexes, the pieces of the greater whole, but now you lack a tool that will “piece them all together.” A parser is such a tool.

For example, consider the task of trying to build a single regex that will validate an arithmetic expression such as 1 * 2 + ( 3 * 4 ). If you attempted to do this in one regex (and I have seen it done, e.g. as in RFC: Perl regex to validate arithmetic expressions, written four years ago), you quickly run into the problem that an arithmetic expression has a semantic structure. It is not simply a stream of characters. (For instance, 1 ) 2 * + 3 4 ( * is not a valid expression, even though it consists of the same nine so-called “tokens.”)

A parser-driven approach would decompose the problem into two or more stages. Regexes would be used to describe the individual tokens that make up the expression. (There are nine tokens in this example.) Then, the grammar would define how the tokens may legitimately appear together in a “valid” sequence.

Parse::RecDescent takes an input which consists of, among other things, a grammar for your language and source-code that is to be textually included into the parser subroutine. This is used to create an executable Perl subroutine behind-the-scenes which becomes the complete recognizer, or parser, for your language. So you get the efficiency of a lean-and-mean Pure Perl subroutine that you did not have to entirely write from scratch.

Every language-processing system ultimately uses this multi-level, lexer/parser driven approach on its front-end. Perl, for example, uses (I think ...) the YACC = Yet Another Compiler-Compiler toolset as the first thing that it unleashes against your source-code. At strategic points, the YACC-generated parser calls other routines within Perl that build the system’s “understanding” of what your source-code says. This is Magickally Transformed into what ultimately drives the runtime language system ... which is (also) an automaton. Structurally speaking, regex evaluation proceeds the same way, although the same tools are not typically used.

Useful pages:

http://biteresources.com/resources/computing/A2/regular_expressions.pdf
http://osteele.com/tools/reanimator/ (requires Flash)
http://www.tattvum.com/regular-expressions-and-compilers
and, certainly not least, https://en.wikipedia.org/wiki/Regular_expression, which has an entire section on implementations and running-times.