Background
I was looking at rt://36098, a bug in Perl::Critic which has a policy against overly complex regular expressions. Right now, the "complexity" of a regular expression is simply its length after a few massages, and this means that an interpolated variable with a long name is regarded as a "complex" regular expression. What I'd like is some other way to assign a numeric value to a regular expression to quantify its complexity.
What is complexity?
Ideally, it's comprehensibility. If I could measure it with a program, it would be the duration of time required for a model programmer to understand the expression. Since I can't measure that, I have to come up with something else.
Treat it like a program
One suggestion was Cyclomatic complexity. This is normally used for analysing programs, but I think it could be applied to regular expressions too. (Some say regular expressions are little programs in their own little language.) This might turn out to be the best option available, but I think implementing it would be a little more work than warranted by the problem.
Matchable strings heuristic
I thought it might be easy and meaningful to count the number of strings an expression could match. This leads to having to make some wild assumptions and other semi-sensible assertions, but it's still the best thing I've thought of.
If I'm literally counting matchable strings, /\d/, /[0-9]/, and /(?:0|1|2|3|4|5|6|7|8|9)/ are all equally complex, but I suspect every reader finds one of them much easier to understand than the others (and probably not every reader agrees on which one). Not only that, /(?:3|7|8|4|1|0|2|6|5)/ is less complex even though you'd have to stare at it longer to realize it's /[0-8]/ by another name.
There's also the problem of /a*/ or /b+/, which are easy to understand but match an infinite number of strings. I think /a*/ may be simpler than /a*?/, but they'll both match infinite strings. In the code I've written, any quantifier counts as two "paths", which means /a*/ is the same as /a|b/. I think that's better than infinity, but perhaps not much better.
These are just a few of the examples of ways this heuristic falls down. About all it has going for it is that it's fairly easy to explain, understand, and implement.
Nodes in the parse tree
I wouldn't be attempting this at all if I couldn't harness the power of Regexp::Parser. Since it gives me a tree of information about the regular expression, maybe I could assess the complexity of that tree. It would be easy to walk the tree and count the nodes, but I don't have a very good idea of how that would "score" things. It might score some structures as overly complex just because of how it represents them. On the other hand, if it requires more complexity to represent something in a data structure, maybe it really is more complicated to understand for a programmer.
Perhaps I could walk the tree with my own scoring system. If I can decide that a character class, /[0-9]/, is more complicated than an equivalent /\d/, I can just use those numbers as I see them. On the other hand, this would be prone to some personal bias.
Another problem with this is that it's very dependent on Regexp::Parser. If it changes its data structures and/or its classifications, a regular expression that hasn't changed could get a new score.
What's so hard to understand?
What do you think about assessing the complexity of regular expressions? What kind of structures do you find easy to understand? What do you like to see and what do you hate to see? If you were writing this, what would be your approach?
|
|---|