Why would one want in a regex a class with only a single entry?

ack has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Why would one want in a regex a class with only a single entry? by ysth (Canon) on Mar 25, 2008 at 06:21 UTC
I can think of two cases: first, a character class might contain a variable with as few as one character in it (`/[$foo]/`); second, under the /x modifier, it's a common way to indicate a literal space (`/ foo [ ] bar /x`). -- Online Fortune Cookie Search	[reply] [d/l] [select]
Re^2: Why would one want in a regex a class with only a single entry? by graff (Chancellor) on Mar 25, 2008 at 22:29 UTC
first, a character class might contain a variable with as few as one character in it `(/[$foo]/);` ... and in that case, it's worthwhile to test the variable first, to make sure it contains something (i.e. `length($foo)>0`), because an empty string in that position is a runtime error, if I'm not mistaken.	[reply] [d/l] [select]
Re^2: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 26, 2008 at 03:41 UTC
Thanks, those both make sense...especially the second one. I don't use the /x modifier much (though as I'm learning, it really helps me remember why I constructed regex's a particular way after I've been away from it for a while) so that one I completely missed. I'm curious about the first case, though. Given what I've been led to believe are the penalities of using classes, why wouldn't one check the variable to see if it has just one entry and cast it differently...i.e., in a more regex-engine-efficient form? Just curious. ack Albuquerque, NM	[reply]
Re^3: Why would one want in a regex a class with only a single entry? by ysth (Canon) on Mar 26, 2008 at 03:49 UTC
Given what I've been led to believe are the penalities of using classes, why wouldn't one check the variable to see if it has just one entry and cast it differently...i.e., in a more regex-engine-efficient form? Because only in very unusual cases would the penalties amount to anything noticeable. And the extra code to treat one character specially means more opportunities for bugs as well as less readable/maintainable code. In any case, the penalty is gone in 5.10.0 and will be gone in 5.8.9. -- Online Fortune Cookie Search	[reply]
Re^4: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 27, 2008 at 05:53 UTC
Re: Why would one want in a regex a class with only a single entry? by davido (Cardinal) on Mar 25, 2008 at 06:26 UTC
One might not hand-craft a regexp with single-entry character classes, but how about code-built regexps? When a regexp is built programatically, and there's a situation where one time a character class may contain one entry, and another times "n" entries, the fact that one-entry character classes are legal is simply one less thing to worry about. In that regard, I can see why it may be useful for such a construct to exist and to be legal. And actually, there may be some single-entry character classes that are helpful, particularly among negated classes: `/[^T]ap/ # Match anything that's not 'T', followed by 'ap'` [download] This is one time where a single-entry character class is actually probably the best way to do it. Dave	[reply] [d/l]
Re^2: Why would one want in a regex a class with only a single entry? by parv (Parson) on Mar 25, 2008 at 06:31 UTC
If a negated class can qualify as a single character class, then so should the POSIX ones (`[::]`), property ones (`\p{}`), "normal" Perl backslash ones (`\d`, `\s`, etc.), among others.	[reply] [d/l] [select]
Re^3: Why would one want in a regex a class with only a single entry? by davido (Cardinal) on Mar 25, 2008 at 06:39 UTC
That doesn't bother me. The OP seemed to be discussing the validity and practicality of single-entry character classes. And I believe he was discussing them in the context of the `[....]` operator in a regexp. The backslash CC's and property CC's are already 'single entry', so probably not really so relevant to the discussion, particularly where they can stand-alone outside of a user-defined character class. Certainly the POSIX classes constitute "single entries". And that's definitely another good argument for why single-entry character classes are practical, and should be legal (which, of course, they are). The negated character class of a single entry just happened to be the first example that jumped to my mind. Your POSIX thought is great too. Dave	[reply] [d/l]
Re^4: Why would one want in a regex a class with only a single entry? by parv (Parson) on Mar 25, 2008 at 08:59 UTC
Re^4: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 26, 2008 at 03:54 UTC
Re^2: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 26, 2008 at 03:49 UTC
Thanks, davido. You and ysth both hit on similar situations. I especially like the negated class (which ends up, in the regex engine being a many-valued class (I suspect...though I'm by no means that knowledgable of the regex engine). In my response to ysth, I wondered why one wouldn't except the single engtry, programmatically built-up case and provide a more regex-engine-efficient cast in that instance. Your comment that: ...is simply one less thing to worry about. makes good sense to me and is in the spirit of what a lot of Perl culture is all about. I think I'd probably take the extra effort to check for the single entry and try to cast it as a special case; but that's just my compulsive nature. Thanks, davido (and everyone) for your insights. You're all helping me to be a better monk. ack Albuquerque, NM	[reply]
Re: Why would one want in a regex a class with only a single entry? by Skeeve (Parson) on Mar 25, 2008 at 07:51 UTC
Additionally to the comments above and not entirely perl related: I often use 1-character classes in Unix when grep-ing for processes: `ps -afe \| grep '[m]yprocess'` [download] People who don't know this trick usualy type: `ps -afe \| grep 'myprocess' \| grep -v grep` [download] `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^2: Why would one want in a regex a class with only a single entry? by rudder (Scribe) on Mar 27, 2008 at 00:41 UTC
Is that a timing thing? Because, if it is, that little trick is pretty diabolical. :)	[reply]
Re: Why would one want in a regex a class with only a single entry? by hipowls (Curate) on Mar 25, 2008 at 07:53 UTC
Character classes don't involve backtracking as this example shows. `my $string = 'bet'; use re qw(debug); my $rx1 = qr/b(?:a\|e)t/; $string =~ /$rx1/; my $rx2 = qr/b[ae]t/; $string =~ /$rx2/;` [download] output of `$rx1` which shows the branch Compiling REx `b(?:a\|e)t' size 12 Got 100 bytes for offset annotations. first at 1 1: EXACT <b>(3) 3: BRANCH(6) 4: EXACT <a>(10) 6: BRANCH(9) 7: EXACT <e>(10) 9: TAIL(10) 10: EXACT <t>(12) 12: END(0) anchored "b" at 0 (checking anchored) minlen 3 Offsets: [12] 1[1] 0[0] 4[1] 5[1] 0[0] 6[1] 7[1] 0[0] 7[0] 9[1] 0[0] 10[0] Guessing start of match, REx "b(?:a\|e)t" against "bet"... Found anchored substr "b" at offset 0... Guessed: match at offset 0 Matching REx "b(?:a\|e)t" against "bet" Setting an EVAL scope, savestack=7 0 <> <bet> \| 1: EXACT <b> 1 <b> <et> \| 3: BRANCH Setting an EVAL scope, savestack=13 1 <b> <et> \| 4: EXACT <a> failed... 1 <b> <et> \| 7: EXACT <e> 2 <be> <t> \| 10: EXACT <t> 3 <bet> <> \| 12: END Match successful! [download] output of `$rx2` where there is no branching. Compiling REx `b[ae]t' size 16 Got 132 bytes for offset annotations. first at 1 1: EXACT <b>(3) 3: ANYOF[ae](14) 14: EXACT <t>(16) 16: END(0) anchored "b" at 0 (checking anchored) minlen 3 Offsets: [16] 1[1] 0[0] 2[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[ +0] 6[1] 0[0] 7[0] Guessing start of match, REx "b[ae]t" against "bet"... Found anchored substr "b" at offset 0... Guessed: match at offset 0 Matching REx "b[ae]t" against "bet" Setting an EVAL scope, savestack=9 0 <> <bet> \| 1: EXACT <b> 1 <b> <et> \| 3: ANYOF[ae] 2 <be> <t> \| 14: EXACT <t> 3 <bet> <> \| 16: END Match successful! Freeing REx: `"b(?:a\|e)t"' Freeing REx: `"b[ae]t"' [download] These examples were run under perl 5.8.8. 5.9.5 introduced trie optimization which gets rid of the branch. Output of `$rx1` under 5.10.0 Compiling REx "b(?:a\|e)t" Final program: 1: EXACT <b> (3) 3: TRIE-EXACT[ae] (10) <a> <e> 10: EXACT <t> (12) 12: END (0) anchored "b" at 0 (checking anchored) minlen 3 Guessing start of match in sv for REx "b(?:a\|e)t" against "bet" Found anchored substr "b" at offset 0... Guessed: match at offset 0 Matching REx "b(?:a\|e)t" against "bet" 0 <> <bet> \| 1:EXACT <b>(3) 1 <b> <et> \| 3:TRIE-EXACT[ae](10) 1 <b> <et> \| State: 1 Accepted: 0 Charid: +2 CP: 65 After State: 3 2 <be> <t> \| State: 3 Accepted: 1 Charid: +2 CP: 0 After State: 0 got 1 possible matches only one match left: #2 <e> 2 <be> <t> \| 10:EXACT <t>(12) 3 <bet> <> \| 12:END(0) Match successful! Freeing REx: "b(?:a\|e)t" [download]	[reply] [d/l] [select]
Re^2: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 26, 2008 at 03:57 UTC
Ah, of course! I mistakenly was thinking of the alternating classes rather than classes themselves. My mistake...and a novice one at that. Thanks, hipowls. ack Albuquerque, NM	[reply]
Re: Why would one want in a regex a class with only a single entry? by Your Mother (Archbishop) on Mar 25, 2008 at 06:32 UTC
Negated single character classes are very useful. Often a quick path for ridding oneself of `.`. Poor (and lucky; use a parser for real work) man's HTML stripper: `s{</?[a-z][^>]>}{}g;` (sidenote: "pique" ne "peak")	[reply] [d/l] [select]
Re: Why would one want in a regex a class with only a single entry? by grinder (Bishop) on Mar 25, 2008 at 12:17 UTC
One justification I have heard people make (not that I subscribe to the idea myself) is that one doesn't have to worry so much about meta-characters, in that nearly every single character loses its meta function in a character class. Instead of having to write `/a\+b/`, one can write `/a[+]b/`. Thus, instead of having to worry about whether a punctuation character needs to be escaped or not, one can stick it in a character class and be done with it. I think this is a false economy, but, if it makes you happy... • another intruder with the mooring in the heart of the Perl	[reply]
Re^2: Why would one want in a regex a class with only a single entry? by davido (Cardinal) on Mar 26, 2008 at 05:32 UTC
It's not a false economy if, for legibility reasons: `m/a[ ]b/ # You wish to make it obvious that a space is needed. m/a[ ]{2,5}b/ # You wish to define a quantifier on a whitespace m/a[ ]+b/ # character that would otherwise be difficult to read. m/a[ ]b/x # You're using the /x modifier where whitespace # is ignored by default.` [download] Dave	[reply] [d/l]
Re^2: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 26, 2008 at 04:27 UTC
Thanks, grinder. I believe, that this is very similar to tsf's use for adding in white space when using the /x modifier. I would note that although, as you noted it may be: ...a false economy... it does seem to improve readability since, to me, /a[+]b/ seems more readable than /a\+b/. ack Albuquerque, NM	[reply]
Re: Why would one want in a regex a class with only a single entry? by dragonchild (Archbishop) on Mar 25, 2008 at 11:44 UTC
I already gave an answer in that thread in Re: The cost of unchecked best practices. It is to more safely deal with changing requirements when using a regex as a parsing tool. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: Why would one want in a regex a class with only a single entry? by ack (Deacon) on Mar 26, 2008 at 04:05 UTC
Thanks, dragonchild. When I saw your response, I got sidetracked thinking about the single entry class. I just went back and looked much more closely at your response and realized that I missed that point. I have not ever tackled what, for me, would be a formidable task of constructing a parsing tool. I tend to use regex's for much simpler constructs. My question was not meant to disrespect your information in that thread; your post had so much depth and good info. UPDATE: Sorry, I just re-read my comment here and realize that I may have inadvertantly mis-communicated. The response I was referring to in the first sentence was the response that dragonchild had posted in the other thread...not this thread. My comment in this reply would seem coarse if it were referring to the response to this thread. I absolutely would never want to respond so poorly. ack Albuquerque, NM	[reply]