ack has asked for the wisdom of the Perl Monks concerning the following question:

I was reading a Meditations node, by moritz. The subject was The cost of unchecked best practices.

His discussion was related to some issues with respect to best practices and some code strategies of another monk in another posting.

But, as a part of his discussion he noted:

This optimization is not performed for char classes with single entries.

I understand completely what moritz was talking about; but his comment got me to thinking and peaked my curiosity.

Classes in regexes, I am told, are somewhat slow (I believe the tend to cause a lot of backtracking as it tests the various alternatives in the class until it finds one that matches...or until none are found to match). The moritz comment leads me to believe that there are, for some programmers, a value to having a regex class with only a single entry; but I don't see the value.

My question is: Why would one want a class with a single entry in a regex?

ack Albuquerque, NM
  • Comment on Why would one want in a regex a class with only a single entry?

Replies are listed 'Best First'.
Re: Why would one want in a regex a class with only a single entry?
by ysth (Canon) on Mar 25, 2008 at 06:21 UTC
    I can think of two cases: first, a character class might contain a variable with as few as one character in it (/[$foo]/); second, under the /x modifier, it's a common way to indicate a literal space (/ foo [ ] bar /x).
      first, a character class might contain a variable with as few as one character in it (/[$foo]/);

      ... and in that case, it's worthwhile to test the variable first, to make sure it contains something (i.e. length($foo)>0), because an empty string in that position is a runtime error, if I'm not mistaken.

      Thanks, those both make sense...especially the second one. I don't use the /x modifier much (though as I'm learning, it really helps me remember why I constructed regex's a particular way after I've been away from it for a while) so that one I completely missed.

      I'm curious about the first case, though. Given what I've been led to believe are the penalities of using classes, why wouldn't one check the variable to see if it has just one entry and cast it differently...i.e., in a more regex-engine-efficient form?

      Just curious.

      ack Albuquerque, NM
        Given what I've been led to believe are the penalities of using classes, why wouldn't one check the variable to see if it has just one entry and cast it differently...i.e., in a more regex-engine-efficient form?
        Because only in very unusual cases would the penalties amount to anything noticeable. And the extra code to treat one character specially means more opportunities for bugs as well as less readable/maintainable code.

        In any case, the penalty is gone in 5.10.0 and will be gone in 5.8.9.

Re: Why would one want in a regex a class with only a single entry?
by davido (Cardinal) on Mar 25, 2008 at 06:26 UTC

    One might not hand-craft a regexp with single-entry character classes, but how about code-built regexps? When a regexp is built programatically, and there's a situation where one time a character class may contain one entry, and another times "n" entries, the fact that one-entry character classes are legal is simply one less thing to worry about. In that regard, I can see why it may be useful for such a construct to exist and to be legal.

    And actually, there may be some single-entry character classes that are helpful, particularly among negated classes:

    /[^T]ap/ # Match anything that's not 'T', followed by 'ap'

    This is one time where a single-entry character class is actually probably the best way to do it.


    Dave

      If a negated class can qualify as a single character class, then so should the POSIX ones ([::]), property ones (\p{}), "normal" Perl backslash ones (\d, \s, etc.), among others.

        That doesn't bother me. The OP seemed to be discussing the validity and practicality of single-entry character classes. And I believe he was discussing them in the context of the [....] operator in a regexp. The backslash CC's and property CC's are already 'single entry', so probably not really so relevant to the discussion, particularly where they can stand-alone outside of a user-defined character class.

        Certainly the POSIX classes constitute "single entries". And that's definitely another good argument for why single-entry character classes are practical, and should be legal (which, of course, they are).

        The negated character class of a single entry just happened to be the first example that jumped to my mind. Your POSIX thought is great too.


        Dave

      Thanks, davido. You and ysth both hit on similar situations. I especially like the negated class (which ends up, in the regex engine being a many-valued class (I suspect...though I'm by no means that knowledgable of the regex engine).

      In my response to ysth, I wondered why one wouldn't except the single engtry, programmatically built-up case and provide a more regex-engine-efficient cast in that instance. Your comment that:

      ...is simply one less thing to worry about.

      makes good sense to me and is in the spirit of what a lot of Perl culture is all about.

      I think I'd probably take the extra effort to check for the single entry and try to cast it as a special case; but that's just my compulsive nature.

      Thanks, davido (and everyone) for your insights. You're all helping me to be a better monk.

      ack Albuquerque, NM
Re: Why would one want in a regex a class with only a single entry?
by Skeeve (Parson) on Mar 25, 2008 at 07:51 UTC

    Additionally to the comments above and not entirely perl related: I often use 1-character classes in Unix when grep-ing for processes:

    ps -afe | grep '[m]yprocess'

    People who don't know this trick usualy type:

    ps -afe | grep 'myprocess' | grep -v grep

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

      Is that a timing thing? Because, if it is, that little trick is pretty diabolical. :)

Re: Why would one want in a regex a class with only a single entry?
by hipowls (Curate) on Mar 25, 2008 at 07:53 UTC

    Character classes don't involve backtracking as this example shows.

    my $string = 'bet'; use re qw(debug); my $rx1 = qr/b(?:a|e)t/; $string =~ /$rx1/; my $rx2 = qr/b[ae]t/; $string =~ /$rx2/;
    output of $rx1 which shows the branch
    Compiling REx `b(?:a|e)t' size 12 Got 100 bytes for offset annotations. first at 1 1: EXACT <b>(3) 3: BRANCH(6) 4: EXACT <a>(10) 6: BRANCH(9) 7: EXACT <e>(10) 9: TAIL(10) 10: EXACT <t>(12) 12: END(0) anchored "b" at 0 (checking anchored) minlen 3 Offsets: [12] 1[1] 0[0] 4[1] 5[1] 0[0] 6[1] 7[1] 0[0] 7[0] 9[1] 0[0] 10[0] Guessing start of match, REx "b(?:a|e)t" against "bet"... Found anchored substr "b" at offset 0... Guessed: match at offset 0 Matching REx "b(?:a|e)t" against "bet" Setting an EVAL scope, savestack=7 0 <> <bet> | 1: EXACT <b> 1 <b> <et> | 3: BRANCH Setting an EVAL scope, savestack=13 1 <b> <et> | 4: EXACT <a> failed... 1 <b> <et> | 7: EXACT <e> 2 <be> <t> | 10: EXACT <t> 3 <bet> <> | 12: END Match successful!
    output of $rx2 where there is no branching.
    Compiling REx `b[ae]t' size 16 Got 132 bytes for offset annotations. first at 1 1: EXACT <b>(3) 3: ANYOF[ae](14) 14: EXACT <t>(16) 16: END(0) anchored "b" at 0 (checking anchored) minlen 3 Offsets: [16] 1[1] 0[0] 2[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[ +0] 6[1] 0[0] 7[0] Guessing start of match, REx "b[ae]t" against "bet"... Found anchored substr "b" at offset 0... Guessed: match at offset 0 Matching REx "b[ae]t" against "bet" Setting an EVAL scope, savestack=9 0 <> <bet> | 1: EXACT <b> 1 <b> <et> | 3: ANYOF[ae] 2 <be> <t> | 14: EXACT <t> 3 <bet> <> | 16: END Match successful! Freeing REx: `"b(?:a|e)t"' Freeing REx: `"b[ae]t"'
    These examples were run under perl 5.8.8. 5.9.5 introduced trie optimization which gets rid of the branch. Output of $rx1 under 5.10.0
    Compiling REx "b(?:a|e)t" Final program: 1: EXACT <b> (3) 3: TRIE-EXACT[ae] (10) <a> <e> 10: EXACT <t> (12) 12: END (0) anchored "b" at 0 (checking anchored) minlen 3 Guessing start of match in sv for REx "b(?:a|e)t" against "bet" Found anchored substr "b" at offset 0... Guessed: match at offset 0 Matching REx "b(?:a|e)t" against "bet" 0 <> <bet> | 1:EXACT <b>(3) 1 <b> <et> | 3:TRIE-EXACT[ae](10) 1 <b> <et> | State: 1 Accepted: 0 Charid: +2 CP: 65 After State: 3 2 <be> <t> | State: 3 Accepted: 1 Charid: +2 CP: 0 After State: 0 got 1 possible matches only one match left: #2 <e> 2 <be> <t> | 10:EXACT <t>(12) 3 <bet> <> | 12:END(0) Match successful! Freeing REx: "b(?:a|e)t"

      Ah, of course! I mistakenly was thinking of the alternating classes rather than classes themselves. My mistake...and a novice one at that.

      Thanks, hipowls.

      ack Albuquerque, NM
Re: Why would one want in a regex a class with only a single entry?
by Your Mother (Archbishop) on Mar 25, 2008 at 06:32 UTC

    Negated single character classes are very useful. Often a quick path for ridding oneself of .*. Poor (and lucky; use a parser for real work) man's HTML stripper: s{</?[a-z][^>]*>}{}g;

    (sidenote: "pique" ne "peak")

Re: Why would one want in a regex a class with only a single entry?
by grinder (Bishop) on Mar 25, 2008 at 12:17 UTC

    One justification I have heard people make (not that I subscribe to the idea myself) is that one doesn't have to worry so much about meta-characters, in that nearly every single character loses its meta function in a character class.

    Instead of having to write /a\+b/, one can write /a[+]b/. Thus, instead of having to worry about whether a punctuation character needs to be escaped or not, one can stick it in a character class and be done with it.

    I think this is a false economy, but, if it makes you happy...

    • another intruder with the mooring in the heart of the Perl

      It's not a false economy if, for legibility reasons:

      m/a[ ]b/ # You wish to make it obvious that a space is needed. m/a[ ]{2,5}b/ # You wish to define a quantifier on a whitespace m/a[ ]+b/ # character that would otherwise be difficult to read. m/a[ ]b/x # You're using the /x modifier where whitespace # is ignored by default.

      Dave

      Thanks, grinder. I believe, that this is very similar to tsf's use for adding in white space when using the /x modifier.

      I would note that although, as you noted it may be:

      ...a false economy...

      it does seem to improve readability since, to me, /a[+]b/ seems more readable than /a\+b/.

      ack Albuquerque, NM
Re: Why would one want in a regex a class with only a single entry?
by dragonchild (Archbishop) on Mar 25, 2008 at 11:44 UTC
    I already gave an answer in that thread in Re: The cost of unchecked best practices. It is to more safely deal with changing requirements when using a regex as a parsing tool.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Thanks, dragonchild. When I saw your response, I got sidetracked thinking about the single entry class.

      I just went back and looked much more closely at your response and realized that I missed that point.

      I have not ever tackled what, for me, would be a formidable task of constructing a parsing tool. I tend to use regex's for much simpler constructs.

      My question was not meant to disrespect your information in that thread; your post had so much depth and good info.

      UPDATE: Sorry, I just re-read my comment here and realize that I may have inadvertantly mis-communicated. The response I was referring to in the first sentence was the response that dragonchild had posted in the other thread...not this thread. My comment in this reply would seem coarse if it were referring to the response to this thread. I absolutely would never want to respond so poorly.

      ack Albuquerque, NM