swiftone has asked for the wisdom of the Perl Monks concerning the following question:

One of the constructs of perl that continues to confuse me is the /x modifier for regexes. I understand what it does and how it works. What I don't understand is how other people find the resulting constructs _easier_ to read. I certainly don't.

I don't think this is a "thinking like a programmer" issue, because the regex is every bit as cryptic. Adding comments helps decode what a section does, but breaking up the regexp makes it harder to see how the different parts relate to one another.

So I'm asking:

  1. Do you find /x regexes easier to read/understand, or harder?
  2. Does this "ease of reading" change one way or the other as you gain experience with regexes?
  3. What kinds of rules should be followed when using /x to ensure the best readability?
I understand that it will always be different from person to person, but I'm looking for general reactions.

Replies are listed 'Best First'.
Re (tilly) 1: Proper use of //x
by tilly (Archbishop) on Nov 30, 2000 at 20:34 UTC
    Answers.
    1. Do you find /x regexes easier to read/understand, or harder?
      In the real world? Generally harder. But some complex/tricky ones benefit.
    2. Does this "ease of reading" change one way or the other as you gain experience with regexes?
      Yes. When you don't understand them, the comments can help. They are a great learning tool for explaining how an RE works. But they are a crutch, and like real crutches, they bang your legs when time comes to start running.
    3. What kinds of rules should be followed when using /x to ensure the best readability?
      My rule is that if I feel tempted to use /x, then that is a sign I need to refactor my problem. Instead of single massive regexes, I like to have straightforward regexes together with higher-level looping logic. For this pos is quite useful to know about.
    Summary: I think /x is a wonderful idea, which I am very glad exists and is excellent for teaching. (Indeed the inspiration came from trying to teach someone.) But using it in production code violates the rule of trying to only maintain one document. If you can read the RE directly, then the comments are superfluous at best, and misleading at worst.

    For anyone who doesn't believe that comments in the wrong place can be misleading, I invite you to read Things are not what they seem like. again. And this isn't just a bizaare obfuscation technique, in the real world overly verbose commenting is a very common way for well-meaning people to produce unmaintainable messes...

Re: Proper use of //x
by Dominus (Parson) on Dec 01, 2000 at 04:12 UTC
    Sometimes it's useful, sometimes it isn't. In my regex class, I have an example that is a complete tokenizer for a calculator program, in one regex. The calculator accepts integer and floating-point numerals, +, -, *, /, ^, and ** operators, = for equality, := for assignment, parentheses for operator grouping, and variable names. The tokenizer is simple and easy to read:

    sub tokens { my @tokens = split m{ ( \*\* | := # ** or := operator | [-+*/^()=] # some other operator | [A-Za-z]\w+ # Identifier | \d*\.\d+(?:[Ee]\d+)? # Decimal number | \d+ # Integer ) }, shift(); return grep /\S/, @tokens; }
    (To see what this does, pass it a string like (Foo := 12) + 37^2-42*bar.)

    Now what would this look like without /x? It would be a lot harder to understand:

    sub tokens { my @tokens = split m{(\*\*|:=|[-+*/^()=]|[A-Za-z]\w+|\d*\.\d+(?:[Ee]\d+)?|\d+)|\s ++}, shift(); return grep /\S/, @tokens; }
Re: Proper use of //x
by japhy (Canon) on Nov 30, 2000 at 23:19 UTC
    Here's an efficient way to properly match HTML comments in a string of text. I use the "unrolling the loop" technique, and thus the commenting ability is helpful.
    $COMMENT_rex = qr{ <!-- # opening <!-- [^-]* # 0 or more non hyphens (?: (?! -- \s* > ) # as long as it's NOT a --> - # - (?: - # - [^-]* # 0 or more non - (?: - [^-]+ )* # -, 1 or more non -, 0 or more times )? # optionally... [^-]* # 0 or more non - )* # 0 or more times -- \s* > # the ending --> }x;
    The regex itself is complex, because the element it's matching is complex. Here's the unadulterated regex:
    $COMMENT_rex = qr{<!--[^-]*(?:(?!--\s*>)-(?:-[^-]*(?:-[^-]+)*)?[^-]*)*--\s*>};
    Ick.

    japhy -- Perl and Regex Hacker
Re: Proper use of //x
by Fastolfe (Vicar) on Nov 30, 2000 at 20:14 UTC
    It depends on the nature and complexity of the regex. Breaking out every little element makes it harder to read for me. Personally, I can read just about any kind of regex you throw at me without comments (and for this I prefer no /x), but like any programming language, sometimes it helps to keep your thoughts in order for maintainability reasons by adding some comments.

    One suggestion I would make to potential /x users is not to break things apart unnecessarily. Break the regex into logical chunks (repeating segments for example), but don't go into depth explaining what each little \d does. We already know what it does.

Re: Proper use of //x
by mirod (Canon) on Nov 30, 2000 at 23:30 UTC

    Actually I like /x, mostly actually to comment properly which parts of the regexp I want to capture:

    $exp=~ m{^([\w]+|\*)\s* # a name or '*' ($1) (?: # an optional condition in brackets \[\s*(\w+) # name ($2) \s*=\s* # = "([^"]*)" # "string" ($3) \s*\])?}x;

    I really find it much easier to maintain later if I know where in the regexp I capture a specific value.

Re: Proper use of //x
by gaspodethewonderdog (Monk) on Nov 30, 2000 at 20:15 UTC
    Well in a lot of cases //x isn't useful... *but*...
    $s =~ /(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA| BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB| CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)/x;
    I've personally had horrendously huge regex's that either were going to be about 3,000 characters on one line or I had to break them up to multiple lines. /x is the only way you can do this without picking up spurious white space and carriage returns and still make things look nice. yes you have to escape any spaces you have, but that's the price you have to pay for readability and maintainability sometimes!
      Personally, I find \s works just as nicely as a space, and is far more consistent with other regex symbols. Is that what you meant by "escape any spaces"?
(Guildenstern) Re: Proper use of //x
by Guildenstern (Deacon) on Nov 30, 2000 at 20:17 UTC
    I tend to try to use //x whenever I write code that I know someone else is going to have to maintain. At the same time, I also put the entire regex on one line in a comment just before the "real" version, so that someone reading the code can see it in both contexts.
    I wouldn't exactly say this makes regexes easier to read, but for me it helps cement in my mind what each part of the regex does. If I have a comment that says "Here we're looking to extract X,Y, and Z", it's not as helpful as having the comment of "This extracts X" directly after the chunk of the RE that does that operation.

    Guildenstern
    Negaterd character class uber alles!
Re: Proper use of //x
by marvell (Pilgrim) on Nov 30, 2000 at 23:00 UTC

    1. Do you find /x regexes easier to read/understand, or harder?

    Generally harder. Regular expresions are generally a doddle to read, even if you have very little experience (see Lama Glama).

    2. Does this "ease of reading" change one way or the other as you gain experience with regexes?

    I don't know if it's just me, but I found that regexps quite easy from the start. I had never seen a /x until I saw one in some code I had to work on. It was no easier to read, and there to look flash, rather than improve the code.

    3. What kinds of rules should be followed when using /x to ensure the best readability?

    I tend to find that if I want to use /x, I should really be doing the job in some more code, not trying to do it in the expression. Code with comments is by far easier to read than a regexp with comments.

    --

    Brother Marvell

Re: Proper use of //x
by zigster (Hermit) on Dec 01, 2000 at 17:42 UTC
    1.) Do you find /x regexes easier to read/understand, or harder?

    Easier to understand, harder to read, and harder to mod. I have found that if a regexp is complex enuff to warrant in line commenting then the inline commenting makes reading the expression itself too damn difficult. The danger IMHO is that you write a regexp then comment it with /x and in effect ensure that the code cannot be changed without removing the comments. I must confess I tend to not use /x I tend to place a comment block above the regexp and paraphrase complex regexp in that.

    2.) Does this "ease of reading" change one way or the other as you gain experience with regexes?

    I think that regexp cant be read at all like regular code (or at least I cant). When reading code I have long since stopped seeing the specific commands and instead look at the flow and interpolate. Just as when reading english I dont think I see all the words my brain just rushes over them filling in the details later. With regexps there seems to be no easy way to do this. I just sit down and read them statement by statement backfilling as required. It gets quicker, because I dont need to refer to a manual now *grins*.

    3.) What kinds of rules should be followed when using /x to ensure the best readability?

    Personally I dont use them at all often, the only time I would has already been listed in the thread, that being to break up regexp's that > 80 chars. I get quite cross if any code line is > 80 chars.

    --

    Zigster