Re: Assessing the complexity of regular expressions

A few random thoughts that I had some time ago about regex complexity. They are far from comprehensive, and not directly usable to measure complexity in some way (and also very personally biased), but I hope they provide food for thought.

Regexes are made of atoms (an atom is something like foobar or \d), groups (which can either capture or not), alternations and quantifiers.

Regexes are visually rather hard to parse if they have many groups, possibly nested.

For the mental complexity (ie trying to assess what a regex does) you have to note that

Most atoms are very easy to understand, independently of whether they are meta-syntactic (like \d or anchors as ^) or literals (like foobar)</c>
Grouping things doesn't make them harder to understand, if you do it with simple (...) or (?:...). The complexity of non-backtracking groups (?>...) is debatable, sometimes they make things much more intuitive to understand, sometimes they are counter-intuiive.
The complexity of character class scales roughly linearly with the number of atoms, independently of possible negation
Look-arounds are hard to get right, look-arounds that are within quantified groups are even harder.
Code assertions... don't even think about them
Back-references are hard, but not as hard as look-arounds.

Comment on Re: Assessing the complexity of regular expressions Select or Download Code

Replies are listed 'Best First'.
Re^2: Assessing the complexity of regular expressions by kyle (Abbot) on Jan 28, 2009 at 17:13 UTC
Thanks for your thoughts! Your biases are what I was looking for. I was hoping that with enough input there would be a consensus (e.g., "look around assertions are confusing to everyone"). Perhaps I should have asked what people have the most trouble getting right, what they find themselves fixing most often, or just what they use the most. This all seems to favor a "score the tree" approach. As I think about that, I wonder if it makes sense to give the user some input into the scoring. That is, someone could say, "I can't ever understand code assertions, so they're of the highest complexity, but I write look-arounds in my sleep, so they're low complexity". On the other hand, it's supposed to be a tool of maintainability so it makes more sense for things to be scored as the typical programmer sees them. Anyway, I appreciate you sharing your thinking on this.	[reply]

Replies are listed 'Best First'.

Re^2: Assessing the complexity of regular expressions
by kyle (Abbot) on Jan 28, 2009 at 17:13 UTC

Thanks for your thoughts! Your biases are what I was looking for. I was hoping that with enough input there would be a consensus (e.g., "look around assertions are confusing to everyone"). Perhaps I should have asked what people have the most trouble getting right, what they find themselves fixing most often, or just what they use the most.

This all seems to favor a "score the tree" approach. As I think about that, I wonder if it makes sense to give the user some input into the scoring. That is, someone could say, "I can't ever understand code assertions, so they're of the highest complexity, but I write look-arounds in my sleep, so they're low complexity". On the other hand, it's supposed to be a tool of maintainability so it makes more sense for things to be scored as the typical programmer sees them.

Anyway, I appreciate you sharing your thinking on this.

[reply]