Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I'm currently trying to write some perl to:

read input
validate that the input is alpha-numeric, a hyphen or a space
strip out "malicious characters"
return the sanitised input

Whilst I seem to be able to match any input that is not "safe", I'm having problems with the substitution. After playing around with it on and off for the last 2 days, I'm not managing to nail it. Can anyone tell me what I'm missing please? (code is part of a larger script, hence the sub, I'm about 3 days into perl so apologies if my code/layout sucks!)

Thanks in advance.
#!/usr/bin/perl #supply test strings on the command line for ease $test1 = shift(@ARGV); sub validate_text { if ( $test1 =~ m/^[A-Za-z0-9\-\ ]+$/ ) { print "Text clean\n"; return ($test1); } else { $errorstring = "Funny business with variables occuring +, have attempted to fix"; print "$errorstring\n"; print "dollartest before fixing = $test1\n"; $test1 =~ s/^[^A-Za-z\-\ ]+$//g; print "dollartest after fixing = $test1\n"; return ($test1); } } $validated = &validate_text; print "the validated text is: $validated\n";

The output from the script follows:
[lowprivuser@localhost testing]$ perl regexp-check2.pl abcd\"efg Funny business with variables occuring, have attempted to fix dollartest before fixing = abcd"efg dollartest after fixing = abcd"efg the validated text is: abcd"efg [lowprivuser@localhost testing]$ perl regexp-check2.pl abcdefg1 Text clean the validated text is: abcdefg1 [lowprivuser@localhost testing]$

Replies are listed 'Best First'.
Re: Inverse regexes for input validation
by sgifford (Prior) on Mar 28, 2007 at 16:45 UTC
    The problem is the ^ "beginning-of-string" anchor at the beginning of your regex, and the $ "end-of-string" anchor at the end. It ends up meaning "if the entire string consists of invalid characters, delete the entire string". What I think you mean is "if there are any invalid characters in the string, replace them with nothing", which you get by simply removing the anchors:
    $test1 =~ s/[^A-Za-z\-\ ]+//g;

    By the way, if you're dealing with untrusted input and trying to avoid doing unsafe things with it, you should read about "taint mode" in perlsec.

Re: Inverse regexes for input validation
by ikegami (Patriarch) on Mar 28, 2007 at 16:43 UTC

    /^[^A-Za-z\-\ ]+$/ will only match if the string contains only bad characters.

    sub validate_text { my ($text) = @_; if ($text =~ s/[^A-Za-z0-9 -]+//g) { print "Funny business fixed\n"; } else { print "Text was already clean\n"; } return $text; } my $validated = validate_text($test1);
      thanks for that, seems I overcomplicated it far too much and wasn't using local vars properly

      turns out all I needed was
      my ($text) = @_; $text =~ s/[^A-Za-z0-9 -]+//g; return $text;
      thanks very much for the help.
        Not that it makes a whole lot of difference in your case, but the "tr///" operator is especially good for this sort of thing -- it's demonstrably faster than "s///" (utf8 wide characters don't even seem to slow it down), and the syntax is a little easier:
        my ( $text ) = shift; $text =~ tr/ 0-9A-Za-z-//cd; return $text;
        The "c" modifier says the match should apply to the complement of the characters cited, and with no characters in the replacement side, "d" says delete all matched characters.
Re: Inverse regexes for input validation
by GrandFather (Saint) on Mar 28, 2007 at 21:34 UTC

    It's strongly recommended btw that you use strictures (use strict; use warnings). Wouldn't have caught this problem, but I can guarantee it will save you hours at some point in the future.


    DWIM is Perl's answer to Gödel