Inverse regexes for input validation

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I'm currently trying to write some perl to:

read input
validate that the input is alpha-numeric, a hyphen or a space
strip out "malicious characters"
return the sanitised input

Whilst I seem to be able to match any input that is not "safe", I'm having problems with the substitution. After playing around with it on and off for the last 2 days, I'm not managing to nail it. Can anyone tell me what I'm missing please? (code is part of a larger script, hence the sub, I'm about 3 days into perl so apologies if my code/layout sucks!)

Thanks in advance.

#!/usr/bin/perl

#supply test strings on the command line for ease
$test1 = shift(@ARGV);

sub validate_text
{
        if ( $test1 =~ m/^[A-Za-z0-9\-\ ]+$/ )
        {
                print "Text clean\n";
                return ($test1);
        } else {
                $errorstring = "Funny business with variables occuring
+, have attempted to fix";

                print "$errorstring\n";
                print "dollartest before fixing = $test1\n";

                $test1 =~ s/^[^A-Za-z\-\ ]+$//g;
                print "dollartest after fixing = $test1\n";

                return ($test1);
        }
}
$validated = &validate_text;

print "the validated text is: $validated\n";
[download]

The output from the script follows:

[lowprivuser@localhost testing]$ perl regexp-check2.pl abcd\"efg

 Funny business with variables occuring, have attempted to fix
dollartest before fixing = abcd"efg
dollartest after fixing = abcd"efg
the validated text is: abcd"efg

[lowprivuser@localhost testing]$ perl regexp-check2.pl abcdefg1

Text clean

the validated text is: abcdefg1

[lowprivuser@localhost testing]$
[download]

Comment on Inverse regexes for input validation Select or Download Code

Replies are listed 'Best First'.
Re: Inverse regexes for input validation by sgifford (Prior) on Mar 28, 2007 at 16:45 UTC
The problem is the `^` "beginning-of-string" anchor at the beginning of your regex, and the `$` "end-of-string" anchor at the end. It ends up meaning "if the entire string consists of invalid characters, delete the entire string". What I think you mean is "if there are any invalid characters in the string, replace them with nothing", which you get by simply removing the anchors: `$test1 =~ s/[^A-Za-z\-\ ]+//g;` [download] By the way, if you're dealing with untrusted input and trying to avoid doing unsafe things with it, you should read about "taint mode" in perlsec. -- sgifford's Web page	[reply] [d/l] [select]
Re: Inverse regexes for input validation by ikegami (Patriarch) on Mar 28, 2007 at 16:43 UTC
`/^[^A-Za-z\-\ ]+$/` will only match if the string contains only bad characters. `sub validate_text { my ($text) = @_; if ($text =~ s/[^A-Za-z0-9 -]+//g) { print "Funny business fixed\n"; } else { print "Text was already clean\n"; } return $text; } my $validated = validate_text($test1);` [download]	[reply] [d/l] [select]
Re^2: Inverse regexes for input validation by chris-lon (Initiate) on Mar 28, 2007 at 17:00 UTC
thanks for that, seems I overcomplicated it far too much and wasn't using local vars properly turns out all I needed was `my ($text) = @_; $text =~ s/[^A-Za-z0-9 -]+//g; return $text;` [download] thanks very much for the help.	[reply] [d/l]
Re^3: Inverse regexes for input validation by graff (Chancellor) on Mar 29, 2007 at 01:15 UTC
Not that it makes a whole lot of difference in your case, but the "tr///" operator is especially good for this sort of thing -- it's demonstrably faster than "s///" (utf8 wide characters don't even seem to slow it down), and the syntax is a little easier: `my ( $text ) = shift; $text =~ tr/ 0-9A-Za-z-//cd; return $text;` [download] The "c" modifier says the match should apply to the complement of the characters cited, and with no characters in the replacement side, "d" says delete all matched characters.	[reply] [d/l]
Re: Inverse regexes for input validation by GrandFather (Saint) on Mar 28, 2007 at 21:34 UTC
It's strongly recommended btw that you use strictures (`use strict; use warnings`). Wouldn't have caught this problem, but I can guarantee it will save you hours at some point in the future. DWIM is Perl's answer to Gödel	[reply] [d/l]