I think this is an interesting question. Mostly because I have spent a lot of time talking with people about similar complexity issues. That being said, I disagree with the "Matchable strings heuristic".

I would argue that (?:0|1|2|3|4|5|6|7|8|9) is more complex (less readable) than \d, because it works at a lower level of abstraction. You have to visually scan all of the alternatives to be convinced that it only matches the digits. You even alluded to that with your comparison of [0-8] and an out of order alternation case.

\d (match digits) is at a higher level of abstraction. Even [0-9] (match characters between 0 and 9) is less complex than the alternation.

This is why code that has multiply nested loops is more complex than code with inner loops moved into subroutines. You don't need to deal with the complexity all at once.

Another counter example would be .* which can match infinitely many strings. But most people would consider it much easier to understand than an expression that tried to be more explicit about all of the strings that can be matched. Since . can match all characters, should it be the most complex single character in the regex?

As for my views on regex complexity, you might can tell that I have a bias against alternation. (I kind of see it as an if-else chain. Not too bad with 2-3 items, but more than that is hard to read.) I think you might consider nesting to be an issue. Multiple levels of parens and quantifiers seem to make a regex harder to read. Backreferences also seem to increase complexity to me. It seems like I need to mentally juggle more to keep up with them.

It will be interesting to see what direction this exploration takes.

G. Wade

In reply to Re: Assessing the complexity of regular expressions by gwadej
in thread Assessing the complexity of regular expressions by kyle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.