in reply to Assessing the complexity of regular expressions
I think this is an interesting question. Mostly because I have spent a lot of time talking with people about similar complexity issues. That being said, I disagree with the "Matchable strings heuristic".
I would argue that (?:0|1|2|3|4|5|6|7|8|9) is more complex (less readable) than \d, because it works at a lower level of abstraction. You have to visually scan all of the alternatives to be convinced that it only matches the digits. You even alluded to that with your comparison of [0-8] and an out of order alternation case.
\d (match digits) is at a higher level of abstraction. Even [0-9] (match characters between 0 and 9) is less complex than the alternation.
This is why code that has multiply nested loops is more complex than code with inner loops moved into subroutines. You don't need to deal with the complexity all at once.
Another counter example would be .* which can match infinitely many strings. But most people would consider it much easier to understand than an expression that tried to be more explicit about all of the strings that can be matched. Since . can match all characters, should it be the most complex single character in the regex?
As for my views on regex complexity, you might can tell that I have a bias against alternation. (I kind of see it as an if-else chain. Not too bad with 2-3 items, but more than that is hard to read.) I think you might consider nesting to be an issue. Multiple levels of parens and quantifiers seem to make a regex harder to read. Backreferences also seem to increase complexity to me. It seems like I need to mentally juggle more to keep up with them.
It will be interesting to see what direction this exploration takes.
|
|---|