Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks,
I've regex below that's used to catch an input (a sentence) that contains illegal characters. I'm puzzled why I needed to escape the $ sign to make a digit "0" not matched i.e. with $ unescaped, a string such as "hello there0" is matched (so not okay), with $ escaped, the same string is not matched (so okay)
if ($input =~ /[<>@#\$%^&*()_+=|\{\}\[\]\/\\]/) { print "not okay"; } else { print "okay"; }
Please enlighten me and many thanks in advance.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regex help
by everybody (Scribe) on Jul 25, 2009 at 17:52 UTC | |
| [reply] |
by ww (Archbishop) on Jul 25, 2009 at 19:47 UTC | |
I don't think so. Did you test that notion? See characters in a character-class in "Mastering Regular Expressions" and Friedl's explanation that character-class-metacharacters are NOT the same as metacharacters outside a character class (pp 9-10). He also alludes to the differences elsewhere in the text. Very little is interpolated inside a character class (with the obvious exceptions of the likes of "-" or an initial "^" negation). However, returning to OP's conundrum: it does NOT appear to be quite as stated. Rather, it appears to me that "Hello there0" matches his posted regex, while removing the preceding "\" before the "$" causes it to fail. Now, the distraction of a very nice summer day (one of very few, thusfar) discourages me from feeling that the following is an adequate answer, but /me thinks "hello there0" passes his regex because his regex does not include decimal digits (see my line 14). However, I must admit, I don't understand why, when using my line 11 rather than 14, the output is as it is: "hello there0" is not okay, but "hello there$ now" is. Brighter minds; wiser heads, pray edify! Update: graff has. See his correct analysis below and everybody's above. Apologies to everybody and to OP. Balance of this post allowed to stand for whatever value (as an object lesson in incomplete testing/analysis) it may have.
| [reply] [d/l] [select] |
by graff (Chancellor) on Jul 25, 2009 at 21:17 UTC | |
The OP figured out that putting backslash in front of "$" would make the regex work as intended, but did not understand why, and everybody gave the correct explanation: without the backslash, you get an interpolation of the variable "$%", and its value turns out to be zero. Try this (NB: this uses bash shell style quoting): You'll see a zero in both lines of output. | [reply] [d/l] |
|
Re: Regex help
by graff (Chancellor) on Jul 25, 2009 at 21:34 UTC | |
I'm assuming those are the "legal" characters because those are the things that we "normally" expect in sentences and that weren't in the OP regex (and because you didn't say anything about dealing with stuff outside the ascii range of 0-127). In this sort of exercise, saying exactly what's legal is often better than trying to list everything that's illegal, because you might find that your data contains "illegal" things that you forgot to include (or didn't know you had to, like unexpected control characters or things outside the ascii range). | [reply] [d/l] |
by Anonymous Monk on Jul 26, 2009 at 06:49 UTC | |
Thanks graff! I've replaced my regex with yours - it makes so much sense :) And thanks to everyone for commenting!</P | [reply] |
|
Re: Regex help
by pKai (Priest) on Jul 25, 2009 at 22:58 UTC | |
Seems like any punctation variables are interpolated inside regex character classes. This is a surprise (for me at least; and to ww above too it seems) Is that documented behaviour? Where?
| [reply] [d/l] |
by graff (Chancellor) on Jul 26, 2009 at 02:50 UTC | |
Yes, in perlre, as follows: An unescaped "$" or "@" interpolates the corresponding variable, while escaping will cause the literal string "\$" to be matched. (Though in the version of the perlre man page I have installed, for perl 5.8.8, this sentence comes second in a paragraph that begins with: You cannot include a literal "$" or "@" within a "\Q" sequence." I can understand that some might consider this obscure.) | [reply] |
by pKai (Priest) on Jul 26, 2009 at 09:48 UTC | |
Those remarks in perlre are not specific to character classes, and one regularly thinks these character classes are more special. Explicit mentioning of $ being special in character classes is found in perlretut#Using-character-classes: …The special characters for a character class are -]\^$ (and the pattern delimiter, whatever it is). ] is special because it denotes the end of a character class. $ is special because it denotes a scalar variable.… So indead not only punctation variables are being expanded:
| [reply] [d/l] [select] |
by Anonymous Monk on Jul 26, 2009 at 03:05 UTC | |
So for no interpolation, you can use qr'', m'', s'''
| [reply] [d/l] [select] |