Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I've regex below that's used to catch an input (a sentence) that contains illegal characters. I'm puzzled why I needed to escape the $ sign to make a digit "0" not matched i.e. with $ unescaped, a string such as "hello there0" is matched (so not okay), with $ escaped, the same string is not matched (so okay)

if ($input =~ /[<>@#\$%^&*()_+=|\{\}\[\]\/\\]/) { print "not okay"; } else { print "okay"; }

Please enlighten me and many thanks in advance.

Replies are listed 'Best First'.
Re: Regex help
by everybody (Scribe) on Jul 25, 2009 at 17:52 UTC
    If you don't escape the $ sign, the '$%' will be interpolated as the filehandles current page number, which in your case happens to be 0.

      I don't think so. Did you test that notion? Suggestion: try moving the "\$" to the last position in the class.

      See characters in a character-class in "Mastering Regular Expressions" and Friedl's explanation that character-class-metacharacters are NOT the same as metacharacters outside a character class (pp 9-10). He also alludes to the differences elsewhere in the text. Very little is interpolated inside a character class (with the obvious exceptions of the likes of "-" or an initial "^" negation).

      However, returning to OP's conundrum: it does NOT appear to be quite as stated. Rather, it appears to me that "Hello there0" matches his posted regex, while removing the preceding "\" before the "$" causes it to fail.

      Now, the distraction of a very nice summer day (one of very few, thusfar) discourages me from feeling that the following is an adequate answer, but /me thinks "hello there0" passes his regex because his regex does not include decimal digits (see my line 14).

      However, I must admit, I don't understand why, when using my line 11 rather than 14, the output is as it is: "hello there0" is not okay, but "hello there$ now" is. Brighter minds; wiser heads, pray edify!

      Update: graff has. See his correct analysis below and everybody's above. Apologies to everybody and to OP. Balance of this post allowed to stand for whatever value (as an object lesson in incomplete testing/analysis) it may have.

      #!/usr/bin/perl use strict; use warnings; # 783199 my @data = <DATA>; for my $line (@data) { chomp $line; # if ( $line =~ /[<>@#\$%^&*()_+=|\{\}\[\]\/\\]/ ) { # OP's char + class ie "escaped the $ sign" # if ( $line =~ /[<>@#%^&*()_+=|\{\}\[\]\/\\]\$/ ) { # OP's char + class ie "escaped the $ sign" with \$ moved to end of class # if ( $line =~ /[<>@#$%^&*()_+=|\{\}\[\]\/\\]/) { # $ sign no +t backslashed # if ( $line =~ /[<>@#\$%^&*()_+=|{}\]\/\\]/ ) { # unnecessa +ry backslashes removed if ( $line =~ /[0-9<>@#\$%^&*()_+=|{}[\]\/\\]/ ) { # rul +e out decimal digits print "\$line: $line \t is not okay \n"; }else { print "\$line: $line \t is okay \n"; } } =head $ sign in char class preceded by a backslash $line: hello there0 is okay $line: hello there is okay $line: hello there$ now is not okay $line: hello there $ now is not okay $line: hello there [ is not okay $line: by [rights] this should fail. is not okay $ sign in char class NOT preceded by a backslash $line: hello there0 is not okay $line: hello there is okay $line: hello there$ now is okay $line: hello there $ now is okay $line: hello there [ is not okay $line: by [rights] this should fail. is not okay Unnecessary backslashes removed: $line: hello there0 is okay $line: hello there is okay $line: hello there$ now is not okay $line: hello there $ now is not okay $line: hello there [ is okay $line: by [rights] this should fail. is not okay and with Decimal Digits rejected: $line: hello there0 is not okay $line: hello there is okay $line: hello there$ now is not okay $line: hello there $ now is not okay $line: hello there [ is not okay $line: by [rights] this should fail. is not okay =cut __DATA__ hello there0 hello there hello there$ now hello there $ now hello there [ by [rights] this should fail.
        Sorry, but it looks to me like the first reply above got it right. The OP said that if the value of $input was "hello there0", and the regex did not have a backslash in front of the "$" in the character class, the regex would match and yield "not okay" as the output. But this goes against the intent of the regex, which is to match only on the particular set of non-alphanumeric characters -- including "$" and "%".

        The OP figured out that putting backslash in front of "$" would make the regex work as intended, but did not understand why, and everybody gave the correct explanation: without the backslash, you get an interpolation of the variable "$%", and its value turns out to be zero. Try this (NB: this uses bash shell style quoting):

        perl -le '$regex = qr:[<>@#$%^&*()_+=|\{\}\[\]\/\\]:; print "$%"; print "$regex";'
        You'll see a zero in both lines of output.
Re: Regex help
by graff (Chancellor) on Jul 25, 2009 at 21:34 UTC
    The first reply above answers the mystery for you, but on a side note, you should consider whether it would be better/easier if the regex specifies just the acceptable characters -- something like:
    my $decision = ( $input =~ /[^\s\w.,:;'"/?!~`-]/ ) ? "not okay" : "oka +y"; print $decision;
    I'm assuming those are the "legal" characters because those are the things that we "normally" expect in sentences and that weren't in the OP regex (and because you didn't say anything about dealing with stuff outside the ascii range of 0-127).

    In this sort of exercise, saying exactly what's legal is often better than trying to list everything that's illegal, because you might find that your data contains "illegal" things that you forgot to include (or didn't know you had to, like unexpected control characters or things outside the ascii range).

      Thanks graff!

      I've replaced my regex with yours - it makes so much sense :)

      And thanks to everyone for commenting!</P

Re: Regex help
by pKai (Priest) on Jul 25, 2009 at 22:58 UTC
    E:\Temp>perl -Mstrict -we "$_=7;die qq(matched '$&'\n) if '1234567_.'= +~/[$_]/" matched '7'

    Seems like any punctation variables are interpolated inside regex character classes.

    This is a surprise (for me at least; and to ww above too it seems)

    Is that documented behaviour? Where?


    Update: Fixed attribution of surprise, naming the wrong person (graff) when citing a post of ww.

      Is that documented behaviour? Where?

      Yes, in perlre, as follows:

      An unescaped "$" or "@" interpolates the corresponding variable, while escaping will cause the literal string "\$" to be matched.

      (Though in the version of the perlre man page I have installed, for perl 5.8.8, this sentence comes second in a paragraph that begins with:

      You cannot include a literal "$" or "@" within a "\Q" sequence."

      I can understand that some might consider this obscure.)

        Those remarks in perlre are not specific to character classes, and one regularly thinks these character classes are more special.

        Explicit mentioning of $ being special in character classes is found in perlretut#Using-character-classes:

        …The special characters for a character class are -]\^$ (and the pattern delimiter, whatever it is). ] is special because it denotes the end of a character class. $ is special because it denotes a scalar variable.…

        So indead not only punctation variables are being expanded:

        E:\Temp>perl -Mstrict -we "my $foo=7;die qq(matched '$&'\n) if '123456 +7rab_.'=~/[${foo}bar]+/" matched '7rab'
      perlre also says Because patterns are processed as double quoted strings, the following also work:
      \t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \E
      So for no interpolation, you can use qr'', m'', s'''
      my $f = 2; print qr/$f/,"\n"; # (?-xism:2) print qr'$f',"\n"; # (?-xism:$f)