Simplifying regexes

ExReg has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Simplifying regexes
by Laurent_R (Canon) on Oct 26, 2015 at 16:02 UTC

You might find an interesting discussion (and quite a bit of information) about that (and about NFA vs. DFA regex engines and many other things) in Jeffrey Friedl's very good book, Mastering Regular Expression (O' Reilly), especially Chapter 4, "The Mechanics of Expression Processing".

[reply]

Re^2: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 16:07 UTC

Yes, I have read that portion several times. It has much good stuff in it, and I just need to read it some more. Perhaps I will be able to make it through it some day without my eyes glazing over first.

[reply]

Re: Simplifying regexes
by AppleFritter (Vicar) on Oct 26, 2015 at 16:46 UTC

As other monks have pointed out, Perl's regular expressions are indeed anything but regular in the formal sense. Formal regular expression use only the following operations:

Concatenation: if a and b are regular expressions, then so is ab.
Alternation: if a and b are regular expressions, then so is (a|b).
The Kleene star: if a is a regular expression, then so is a*.

If you want to wrap your head around what these regular expressions really mean, though, it's best to start with formal languages, specifically regular languages. Regular expressions describe these languages in a very natural and obvious manner.

That said, while fascinating, none of this will help a lot with simplifying Perl regular expressions in a Perl program. For that, you'll need a good intuition of how regular expressions work in Perl, and for that you'll simply have to use them, time and again.

If you're not getting along with Mastering Regular Expressions, BTW, the chapter on regular expressions in Programming Perl is eminently readable and accessible.

[reply]

Re^2: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 17:23 UTC

Reading the replies above, I get the impression that looking at the theoretical Regular Expression material, while probably good to learn, is limited in its applicability to Perl regular expressions. It would appear that the path to enlightenment is through practice, and not pedagoguery.

[reply]

Re^3: Simplifying regexes

by Laurent_R (Canon) on Oct 26, 2015 at 17:56 UTC

Reading the replies above, I get the impression that looking at the theoretical Regular Expression material, while probably good to learn, is limited in its applicability to Perl regular expressions.

[reply]

Re^3: Simplifying regexes

by AnomalousMonk (Archbishop) on Oct 27, 2015 at 08:55 UTC

It would appear that the path to enlightenment is through practice, and not pedagoguery.

This has been my experience. "Regular expressions," as they have evolved (into ir-regular expressions, as others have noted), are the most counterintuitive things I have encountered in programming.

Here's my favorite example of this conceptual orneriness. What will the regex /(b*)/ capture when run against the string 'aaaaabbbb' and where (i.e., at what character position offset)? Knowing as we do that * matches "as much as possible" of something, I'd almost be willing to bet money that even the most regex-savvy will experience at least a minor knee-jerk twitch in the general direction of "it matches 'bbbb' at offset 5." But surprise, surprise:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = 'aaaaabbbb';
 print qq{matches '$1' at offset $-[1]} if $s =~ /(b*)/;
"
matches '' at offset 0
[download]

'b'

The answer to the puzzle is that "as much as possible" cheats and leaves out the leftmost and equally important part of the "leftmost longest" incantation that should properly be used^* to describe all regex matching. Bottom line: No amount of theoretic or pedagogic vaccination can build up your natural antibody resistance to this sort of thought-bug better than daily exposure to a wide variety of regex challenges. Good luck in your experimentation, and may you make many mistakes, for I know no better way to learn this stuff.

* Old Whateley and his son Wilbur knew the dangers of incomplete and possibly maliciously altered incantations.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^2: Simplifying regexes

by locked_user sundialsvc4 (Abbot) on Oct 26, 2015 at 17:27 UTC

Parse::RecDescent does take some time to get to know, and I wish that the tutorials were better. But the key concept is that it is very useful when you have a complex, naturally structured input. An input in which it is easy to describe, in the form of regexes, the pieces of the greater whole, but now you lack a tool that will “piece them all together.” A parser is such a tool.

For example, consider the task of trying to build a single regex that will validate an arithmetic expression such as 1 * 2 + ( 3 * 4 ). If you attempted to do this in one regex (and I have seen it done, e.g. as in RFC: Perl regex to validate arithmetic expressions, written four years ago), you quickly run into the problem that an arithmetic expression has a semantic structure. It is not simply a stream of characters. (For instance, 1 ) 2 * + 3 4 ( * is not a valid expression, even though it consists of the same nine so-called “tokens.”)

A parser-driven approach would decompose the problem into two or more stages. Regexes would be used to describe the individual tokens that make up the expression. (There are nine tokens in this example.) Then, the grammar would define how the tokens may legitimately appear together in a “valid” sequence.

Parse::RecDescent takes an input which consists of, among other things, a grammar for your language and source-code that is to be textually included into the parser subroutine. This is used to create an executable Perl subroutine behind-the-scenes which becomes the complete recognizer, or parser, for your language. So you get the efficiency of a lean-and-mean Pure Perl subroutine that you did not have to entirely write from scratch.

Every language-processing system ultimately uses this multi-level, lexer/parser driven approach on its front-end. Perl, for example, uses (I think ...) the YACC = Yet Another Compiler-Compiler toolset as the first thing that it unleashes against your source-code. At strategic points, the YACC-generated parser calls other routines within Perl that build the system’s “understanding” of what your source-code says. This is Magickally Transformed into what ultimately drives the runtime language system ... which is (also) an automaton. Structurally speaking, regex evaluation proceeds the same way, although the same tools are not typically used.

Useful pages:

http://biteresources.com/resources/computing/A2/regular_expressions.pdf
http://osteele.com/tools/reanimator/ (requires Flash)
http://www.tattvum.com/regular-expressions-and-compilers
and, certainly not least, https://en.wikipedia.org/wiki/Regular_expression, which has an entire section on implementations and running-times.

Re: Simplifying regexes
by Athanasius (Archbishop) on Oct 26, 2015 at 15:33 UTC

Hello ExReg,

what does ɛ correspond to in our usage?

In formal language theory as applied to regular expressions, ε (epsilon) denotes the empty string. Hence ε corresponds to "" in our (Perl) usage.

See Regular_expression#Formal_definition (which incorrectly, IMO, defines ε as the set containing the empty string) and, more accurately, Empty_string.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^2: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 15:53 UTC

Thank you. I have not looked at this obvious site in several years, and it has a nice summary of the kind of stuff I am looking for.

[reply]

Re: Simplifying regexes
by GotToBTru (Prior) on Oct 26, 2015 at 15:14 UTC

I seem to recall some discussions here on deviations between the math/compsci concept of Regular Expressions, and the actual implementation in the Perl engine. It seems like you have already hit upon a source to describe the former; for the latter, I suggest:

Tutorials here to learn (and also Super Search for more examples)
This site to experiment with what you've learned.

Update: example of one such discussion.

Dum Spiro Spero

[reply]

Re^2: Simplifying regexes

by AnomalousMonk (Archbishop) on Oct 26, 2015 at 16:09 UTC

This site (above) is written and maintained by davido. See his personal node for another link in case you're ever at a loss for this one.

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]

Re^3: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 17:26 UTC

Bookmarked. Thanks!

[reply]

Re^2: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 15:24 UTC

Thank you for the links to the various tutorials and discussions. I have read all those over the many years I have used regexes. I feel fairly confident using regexes, including lookarounds, although the dynamic types are still fairly new to me. I am looking for something a little more abstract and theoretical that would probably get into NFA and DFA. I have read the section in Mastering Regular Expressions towards the end that introduces the concepts, but would like more info.

[reply]

Re: Simplifying regexes
by BrowserUk (Patriarch) on Oct 26, 2015 at 15:16 UTC

I have been attempting to get a firm grasp on the ins and outs of simplifying regular expressions.

What is your goal here?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]

Re^2: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 15:34 UTC

In the tools that I now work with that have been developed over many years, there are regexes that can be rather abstruse. Going through them with a /x can help. I have run across some regexes that are either recursive or part of a circular definition. I am looking for insight on how to refactor them into non self referential forms so that they can work, and am hoping to become more enlightened on Regexes. I never had courses on these kind of topics, so I have a huge void to fill. I have a lot of the practical down. Now I would like to get a bit of the theoretical.

[reply]

Re^3: Simplifying regexes

by BrowserUk (Patriarch) on Oct 26, 2015 at 16:17 UTC

Warning: I'm also very light on the theory of NFA's and DFA's.

Perl was my first language (of then (13 years ago) ~10 languages I'd used in earnest) that provided "regex". I found them really hard to get to grips with.

By about 8 or 9 years ago I'd played with them enough that I could do most things I wanted to do with them; though I often found myself needing to experiment with them to arrive at a "solution" rather than being able to plan that solution.

It was around that time that I read Friedl's book. And, at the risk of incurring even more wrath than normal, I'm going to say that: it bored me to tears. Immense; thorough; an opus of extraordinary accomplishment that it is; the one word I cannot use in praise of it is 'enlightening'.

I tried applying what I thought I had learnt from reading it, to my subsequent attempts at regex and found that I still had to 'suck it and see', when it came to solving real world problems.

SO then I went off to Wikipedia and read the stuff on NFA's and DFA's; and followed the links from those pages; and read (scanned) a bunch of theses, papers and academic journals on the subjects and ...;

I was still none the wiser.

What I did learn was that Perl (compatible) regex do not comply with the academic descriptions or rules for either type of automaton. It shortcuts the rules for DFA's willy-nilly and defies categorisation amongst the (formal) NFA descriptions. In short: Perl's regular expressions are anything but regular; and (IMO) formal regex theory has little or nothing to teach you if you are already accomplished at using them practically.

(My)bottom line is that when to comes to optimising Perl's regex, there is only one way to proceed: Benchmark! If you can conceive of two (or more) ways to achieve a particular goal; time'em and opt for the quicker; because all the reading in the world of formal regular expression theory will not help one iota.

Good luck with your research.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]

Re^4: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 17:24 UTC

Re: Simplifying regexes
by locked_user sundialsvc4 (Abbot) on Oct 26, 2015 at 15:44 UTC

It can be very instructive to look behind the scenes of any regular-expression engine implementation ... or, for that matter, into the lexer/parser engines which drive the initial stages of any compiler or interpreter (including Perl). The implementation is a finite automaton of some sort, driven by a grammar. However, this is used to produce an internal data structure (e.g. an AST = Abstract-Syntax Tree of some kind), which is then manipulated to produce the data structures which drive the actual engine of the thing, which is another (but very different) automaton. You might see any and all of the stages found in a “bigger” interpreter/compiler, including optimization. (I even encountered one voodoo-implementation that spat out a compiled machine-code subroutine for über-fast performance in the days when computers were a lot smaller and slower than they now are.)

The main driver of these systems is a set of precedence-rules and other heuristics which determine exactly how the source-language string will be decomposed to an AST, and thence how the AST will be further consumed to produce the final road-map for the execution engine. These do not particularly correspond to mathematical laws or algebraic identities, even though at a superficial level there is an apparent similarity (especially in the case of regular expressions, which at first glance appear to be equations). Regexes are not equations: they are computer programs expressed in a very, very terse form. They are source code. And so, the actual path between “the source-code as written,” and the eventual behavior of the execution engine, is the total behavior of a multi-step (albeit very small) interpreter having both compile-time and run-time stages. Although you might be able to point to apparent mathematical identities, what you are really doing is looking at the precedence rules ... and comparing apples to oranges. This is a concern of grammars, not mathematical algebra.

In passing, I would also opine that sometimes people ask “a single regex” to do too much. Sometimes what you really need is more-than-one regex, used in the context of a language parser of your very own, which (yay!) you don’t have to write from scratch. For example, Parse::RecDescent, which is an amazing (not so ...) little piece of software that I have used with great success to efficiently do things that just didn’t seem possible ... quickly and efficiently. Many things that would produce “a write-only regex” are easily and efficiently(!) handled by a P::RD grammar, and the work is done at runtime by a Perl subroutine which P::RD builds on-the-fly. Check it out ... it is definitely a Perl package that you should make it your business to know because, once you do, you’ll be using it a lot.

Re^2: Simplifying regexes

by ExReg (Priest) on Oct 26, 2015 at 16:04 UTC

I love what you are saying. I only wish I could fully comprehend all of what you said. If I could, I think it would suffice. As far as asking a regex to do too much, yes I often do. It is such a neat tool to use that I often forget the Swiss Army chainsaw of the rest of Perl. I also have recently installed Parse::RecDescent at home. Wish I could get it at work.

[reply]