Re: Simplifying regexes

It can be very instructive to look behind the scenes of any regular-expression engine implementation ... or, for that matter, into the lexer/parser engines which drive the initial stages of any compiler or interpreter (including Perl). The implementation is a finite automaton of some sort, driven by a grammar. However, this is used to produce an internal data structure (e.g. an AST = Abstract-Syntax Tree of some kind), which is then manipulated to produce the data structures which drive the actual engine of the thing, which is another (but very different) automaton. You might see any and all of the stages found in a “bigger” interpreter/compiler, including optimization. (I even encountered one voodoo-implementation that spat out a compiled machine-code subroutine for über-fast performance in the days when computers were a lot smaller and slower than they now are.)

The main driver of these systems is a set of precedence-rules and other heuristics which determine exactly how the source-language string will be decomposed to an AST, and thence how the AST will be further consumed to produce the final road-map for the execution engine. These do not particularly correspond to mathematical laws or algebraic identities, even though at a superficial level there is an apparent similarity (especially in the case of regular expressions, which at first glance appear to be equations). Regexes are not equations: they are computer programs expressed in a very, very terse form. They are source code. And so, the actual path between “the source-code as written,” and the eventual behavior of the execution engine, is the total behavior of a multi-step (albeit very small) interpreter having both compile-time and run-time stages. Although you might be able to point to apparent mathematical identities, what you are really doing is looking at the precedence rules ... and comparing apples to oranges. This is a concern of grammars, not mathematical algebra.

In passing, I would also opine that sometimes people ask “a single regex” to do too much. Sometimes what you really need is more-than-one regex, used in the context of a language parser of your very own, which (yay!) you don’t have to write from scratch. For example, Parse::RecDescent, which is an amazing (not so ...) little piece of software that I have used with great success to efficiently do things that just didn’t seem possible ... quickly and efficiently. Many things that would produce “a write-only regex” are easily and efficiently(!) handled by a P::RD grammar, and the work is done at runtime by a Perl subroutine which P::RD builds on-the-fly. Check it out ... it is definitely a Perl package that you should make it your business to know because, once you do, you’ll be using it a lot.

Replies are listed 'Best First'.
Re^2: Simplifying regexes by ExReg (Priest) on Oct 26, 2015 at 16:04 UTC
I love what you are saying. I only wish I could fully comprehend all of what you said. If I could, I think it would suffice. As far as asking a regex to do too much, yes I often do. It is such a neat tool to use that I often forget the Swiss Army chainsaw of the rest of Perl. I also have recently installed Parse::RecDescent at home. Wish I could get it at work.	[reply]