Background

I was looking at rt://36098, a bug in Perl::Critic which has a policy against overly complex regular expressions. Right now, the "complexity" of a regular expression is simply its length after a few massages, and this means that an interpolated variable with a long name is regarded as a "complex" regular expression. What I'd like is some other way to assign a numeric value to a regular expression to quantify its complexity.

What is complexity?

Ideally, it's comprehensibility. If I could measure it with a program, it would be the duration of time required for a model programmer to understand the expression. Since I can't measure that, I have to come up with something else.

Treat it like a program

One suggestion was Cyclomatic complexity. This is normally used for analysing programs, but I think it could be applied to regular expressions too. (Some say regular expressions are little programs in their own little language.) This might turn out to be the best option available, but I think implementing it would be a little more work than warranted by the problem.

Matchable strings heuristic

I thought it might be easy and meaningful to count the number of strings an expression could match. This leads to having to make some wild assumptions and other semi-sensible assertions, but it's still the best thing I've thought of.

If I'm literally counting matchable strings, /\d/, /[0-9]/, and /(?:0|1|2|3|4|5|6|7|8|9)/ are all equally complex, but I suspect every reader finds one of them much easier to understand than the others (and probably not every reader agrees on which one). Not only that, /(?:3|7|8|4|1|0|2|6|5)/ is less complex even though you'd have to stare at it longer to realize it's /[0-8]/ by another name.

There's also the problem of /a*/ or /b+/, which are easy to understand but match an infinite number of strings. I think /a*/ may be simpler than /a*?/, but they'll both match infinite strings. In the code I've written, any quantifier counts as two "paths", which means /a*/ is the same as /a|b/. I think that's better than infinity, but perhaps not much better.

These are just a few of the examples of ways this heuristic falls down. About all it has going for it is that it's fairly easy to explain, understand, and implement.

Nodes in the parse tree

I wouldn't be attempting this at all if I couldn't harness the power of Regexp::Parser. Since it gives me a tree of information about the regular expression, maybe I could assess the complexity of that tree. It would be easy to walk the tree and count the nodes, but I don't have a very good idea of how that would "score" things. It might score some structures as overly complex just because of how it represents them. On the other hand, if it requires more complexity to represent something in a data structure, maybe it really is more complicated to understand for a programmer.

Perhaps I could walk the tree with my own scoring system. If I can decide that a character class, /[0-9]/, is more complicated than an equivalent /\d/, I can just use those numbers as I see them. On the other hand, this would be prone to some personal bias.

Another problem with this is that it's very dependent on Regexp::Parser. If it changes its data structures and/or its classifications, a regular expression that hasn't changed could get a new score.

What's so hard to understand?

What do you think about assessing the complexity of regular expressions? What kind of structures do you find easy to understand? What do you like to see and what do you hate to see? If you were writing this, what would be your approach?

Replies are listed 'Best First'.
Re: Assessing the complexity of regular expressions
by moritz (Cardinal) on Jan 27, 2009 at 18:32 UTC

    A few random thoughts that I had some time ago about regex complexity. They are far from comprehensive, and not directly usable to measure complexity in some way (and also very personally biased), but I hope they provide food for thought.

    Regexes are made of atoms (an atom is something like foobar or \d), groups (which can either capture or not), alternations and quantifiers.

    Regexes are visually rather hard to parse if they have many groups, possibly nested.

    For the mental complexity (ie trying to assess what a regex does) you have to note that

    • Most atoms are very easy to understand, independently of whether they are meta-syntactic (like \d or anchors as ^) or literals (like foobar)</c>
    • Grouping things doesn't make them harder to understand, if you do it with simple (...) or (?:...). The complexity of non-backtracking groups (?>...) is debatable, sometimes they make things much more intuitive to understand, sometimes they are counter-intuiive.
    • The complexity of character class scales roughly linearly with the number of atoms, independently of possible negation
    • Look-arounds are hard to get right, look-arounds that are within quantified groups are even harder.
    • Code assertions... don't even think about them
    • Back-references are hard, but not as hard as look-arounds.

      Thanks for your thoughts! Your biases are what I was looking for. I was hoping that with enough input there would be a consensus (e.g., "look around assertions are confusing to everyone"). Perhaps I should have asked what people have the most trouble getting right, what they find themselves fixing most often, or just what they use the most.

      This all seems to favor a "score the tree" approach. As I think about that, I wonder if it makes sense to give the user some input into the scoring. That is, someone could say, "I can't ever understand code assertions, so they're of the highest complexity, but I write look-arounds in my sleep, so they're low complexity". On the other hand, it's supposed to be a tool of maintainability so it makes more sense for things to be scored as the typical programmer sees them.

      Anyway, I appreciate you sharing your thinking on this.

Re: Assessing the complexity of regular expressions
by gwadej (Chaplain) on Jan 27, 2009 at 18:11 UTC

    I think this is an interesting question. Mostly because I have spent a lot of time talking with people about similar complexity issues. That being said, I disagree with the "Matchable strings heuristic".

    I would argue that (?:0|1|2|3|4|5|6|7|8|9) is more complex (less readable) than \d, because it works at a lower level of abstraction. You have to visually scan all of the alternatives to be convinced that it only matches the digits. You even alluded to that with your comparison of [0-8] and an out of order alternation case.

    \d (match digits) is at a higher level of abstraction. Even [0-9] (match characters between 0 and 9) is less complex than the alternation.

    This is why code that has multiply nested loops is more complex than code with inner loops moved into subroutines. You don't need to deal with the complexity all at once.

    Another counter example would be .* which can match infinitely many strings. But most people would consider it much easier to understand than an expression that tried to be more explicit about all of the strings that can be matched. Since . can match all characters, should it be the most complex single character in the regex?

    As for my views on regex complexity, you might can tell that I have a bias against alternation. (I kind of see it as an if-else chain. Not too bad with 2-3 items, but more than that is hard to read.) I think you might consider nesting to be an issue. Multiple levels of parens and quantifiers seem to make a regex harder to read. Backreferences also seem to increase complexity to me. It seems like I need to mentally juggle more to keep up with them.

    It will be interesting to see what direction this exploration takes.

    G. Wade
Re: Assessing the complexity of regular expressions
by ikegami (Patriarch) on Jan 27, 2009 at 17:53 UTC
    The depth of the pattern op tree or the number of forks in it would other options. A long pattern is probably easier to understand than a heavily nested pattern.
Re: Assessing the complexity of regular expressions
by tilly (Archbishop) on Jan 27, 2009 at 18:53 UTC
    Calculating it may be too slow, but one reasonable measure is the length of the compiled version of the regular expression. You can get that printed to STDERR in Perl 5.10 with the re pragma. (I think you want the Debug DUMP option.) Unfortunately capturing that is hard, but in principle you can run it in an external process, capture the output, then look at it.

      I like that idea, but it's not really safe, what with code blocks inside of regular expressions these days.

      use re 'debug'; my $x = qr/a(??{ BEGIN { die } })/; __END__ Compiling REx "a(??{ BEGIN{ die } })" panic: top_env

      Tests with code other than die also shows the BEGIN block being executed even though the regular expression is never used.

        It's always good to be on the lookout for security issues. However, just whose code do you plan on using Perl::Critic to critique? Isn't the idea to get people who are already able to execute arbitrary code on your team's systems to think about what they are doing? I'm not sure I'd want to run any random code from out in the wild through any part of Perl::Critic or any other development tool without checking for nasty things like that first. Perhaps in the right context, this safety risk is totally acceptable. In what situation are you using Perl::Critic that this would be a serious problem?
        Good point. However I'll note that the risk is somewhat less when it is run in an external process. Though admittedly, it is not totally gone.
Re: Assessing the complexity of regular expressions
by Jenda (Abbot) on Jan 27, 2009 at 23:07 UTC

    Why?

    I mean it may be a nice exercise, a nice meditation, but what's the use? So Perl::Critic tells you that the regexp on line 1234 of your script is more complex that whatever random treshold someone set. What now? You can add /x, lot's of whitespace and some comments and hope Perl::Critic notices and actually think it's less complex then. Or you can move parts of the regexp into variables and construct the regexp from parts. And again hope Perl::Critic notices and computes the complexity of the stuff that's left on that particular line and not of the completed regexp. Or, you can just turn this off.

      I think the people who use Perl::Critic see some value in it. They've already answered this question of why. Among them, if they find that this particular policy stifles their creativity or fails to add value, they can turn it off either globally or for special occasions. For the ones who want it, I want to make it for them the best that I can.

      (As an aside, I wonder if you would say the same thing of a large subroutine. Say it has too many lines or too many levels of indention, or what-have-you. Is there any gain to chunking it out to multiple subroutines (as an analog to moving parts of the regexp into variables and constructing the whole from parts)? Is it worth formatting it and commenting it?)

        I'm not asking why Perl::Critic. I'm asking why this policy. If you've got a subroutine that's too long/indented, you can split it, give some parts their own name and you end up with something that's better than the original. For some definitions of better.

        Doing the same with a regexp is most often either impossible, complicates the code or most of the proposed implementation of this policy would not even notice you've made any change. Perl::Critic should point out places you should probably change, there's no point in pointing out things you can't change.

Re: Assessing the complexity of regular expressions
by mirod (Canon) on Jan 27, 2009 at 19:47 UTC

    How about something simple? Count the meta characters, assigning a score to each of them if you want to be fancy. \d is 1, [0-9] is 3 ('[', '-' and ']'), (0|1|2|3|4|5|6|7|8|9) is 11. And yes \\ is 2, and so is \+

    That would give you a simple to compute, yet, I think, vaguely adequate number to assign to each regexp.

Re: Assessing the complexity of regular expressions
by duelafn (Parson) on Jan 28, 2009 at 22:56 UTC

    In all the examples I see, the number of characters that fail to match \w is a good measure of complexity. If you added s/\w+/x/g to the end of your massaging, I think you would get much better results. For example, I consider /[-+=:;"&*'#^`<>]/ quite a bit more complicated than /[qetuoljgdaxvnm]/ even though both are attempting to match a collection of literals. At first glance you fear that the first may do more than it does. Moreover if you were to attempt to add things to this class, you may need to be careful (position of characters matters, some chars need to be escaped). This also solves your long variable name problem.

    Good Day,
        Dean