kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I've re-implemented the check user input sub Check input sub - comments please... to make it more flexible. Instead of passing the sanitize sub a list of scalars, I'm now passing it a hash to improve readability and to allow for some flexibility.

$username = sanitize( data => param('username'), # call from CGI form field => 'Username', # field name obligatory => 1, # field is obligatory min => 8, max => 8, regex => '%;&()#\w ', # customised regex ); sub sanitize { # date, field, obligatory, min, max, numeric, regex my %checks = @_; $checks{'data'} =~ s/\s+/ /g; # remove leading and trailing blanks $checks{'data'} =~ s/^\s+//; $checks{'data'} =~ s/\s+$//; # Default regex my $default = '-\@\w. '; $checks{'regex'} = $checks{'regex'} ? $checks{'regex'} : $default; # This part changed by graff my $length = length($checks{'data'}); if (!$length) { # empty string if ($checks{'obligatory'}) { bail_out("$checks{'field'} is obligatory"); } else { return; } } if ($checks{'data'} =~ /([^$checks{'regex'}])/) { bail_out("Bad input."); } if ($checks{'numeric'} and $checks{'data'} =~ /[^\d]/) { bail_out("Non-numeric character(s) in numeric field"); } if ($checks{'min'} and $length < $checks{'min'}) { bail_out("Too long."); } if ($checks{'max'} and $length > $checks{'max'}) { bail_out("Too short"); } return $checks{'data'}; }

All comments are welcomed :)

ps: I know great modules exist to do that sort of stuff. I just want to try and create something simple for my own use. I see it as part of my perl learning journey.

update 16 Dec 2003

Many thanks to all, especially to William G. Davis. My little user input checking code is now in a much better shape.

Replies are listed 'Best First'.
Re: Check user input - reimplementation
by hanenkamp (Pilgrim) on Dec 15, 2003 at 13:39 UTC
Re: Check user input - reimplementation
by Roy Johnson (Monsignor) on Dec 15, 2003 at 18:08 UTC
    $checks{'data'} =~ s/\s+/ /g; # remove leading and trailing blanks $checks{'data'} =~ s/^\s+//; $checks{'data'} =~ s/\s+$//;
    I'd do this in reverse order, so that you aren't substituting in a space that you're later going to remove. That's just a tiny amount of extra work. If the spaces are really spaces and not tabs, you can squeeze them using tr/ //s instead of s/\s+/ /g.
    $checks{'regex'} = $checks{'regex'} ? $checks{'regex'} : $default;
    The common idiom for this is
    $checks{'regex'} ||= $default;
    but be sure you mean that all false values should be replaced with default. You may have meant, instead
    exists($checks{'regex'}) or $checks{'regex'} = $default;
    I note that regex is really expected to be a character class. It would be simple to take a real regex (qr// style) instead.

    The PerlMonk tr/// Advocate
      Thanks, Roy!

      I'd do this in reverse order, so that you aren't substituting in a space that you're later going to remove. That's just a tiny amount of extra work.
      You mean like this?
      $checks{'data'} =~ s/\s+$//; $checks{'data'} =~ s/^\s+//; $checks{'data'} =~ s/\s+/ /g;
      Okay so
      $checks{'regex'} ||= $default;
      is the same as
      $checks{'regex'} = $checks{'regex'} ? $checks{'regex'} : $default;
      But what does the following do?
      exists($checks{'regex'}) or $checks{'regex'} = $default;
        You mean like this?
        Yes.

        A ||= D is, in general, the same as A = A ?  A : D without all the repetition. Each assigns to A itself if A is true, and D if it is not true.

        The exists check is the same as

        $check{'regex'} = $default unless exists($checks{'regex'})
        which is to say, it determines whether there a value has been assigned to $checks{'regex'}, and if not, it assigns the default value. Even if the value is false or undefined, exists returns true.

        The PerlMonk tr/// Advocate
Re: Check user input - reimplementation
by ysth (Canon) on Dec 15, 2003 at 18:08 UTC
    I wouldn't call the parameter regex, since it isn't. Perhaps charclass? Update: It might be worthwhile to provide both a regex and charclass option.
Re: Check user input - reimplementation
by YuckFoo (Abbot) on Dec 15, 2003 at 18:48 UTC
    kiat,

    Your numeric check is pretty weak. I think it should at least allow for a leading '-'. For some ideas for numeric regex checks see perlman: perlfaq4 or do perldoc -q 'is a number' from your command line.

    YuckFoo

      Thanks YuckFoo!

      In the context I'm using it, it's really just checking for a number - like an id thing that's associated with a name. I'll look up the 'is a number' check to see how I can modify that part.

Re: Check user input - reimplementation
by William G. Davis (Friar) on Dec 15, 2003 at 19:03 UTC

    "obligatory"--how about calling it "required" instead? "min", "max,"--minimum and maximum what? How about "minlength" and "maxlength" instead? Another thing I'd recommend is, Instead of having a "regex" parameter you just have a catch-all "value" parameter that can contain a string, a number, or a compiled regex that you compare the actual parameter value with. To distinguish between them, remember that any string containing a regex compiled with the qr// operator has a ref type of "Regexp," e.g., this:

    my $re = qr/some(thing)?/; print ref $re;

    Prints out "Regexp". No, this behavior is NOT documented (at least that I can find), but it's quite useful. So you can do something like this:

    my $matches; if (ref $value eq 'Regexp') { $matches++ if ($param_value =~ $value); } else { $matches++ if ($param_value eq $value); }

    Which compares the parameter value in $param_value against the regex in $value if $value if it contains a compiled regex, or it just compares the string or number in $value against the parameter value in $param_value otherwise.

      Thanks Williams!

      Yes, 'required', 'minlength' and 'maxlength' are all more meaningful than the ones I'm using. Will use yours definitely.

      I'm not so certain about how to change the catching bad characters part from your input on regex. That is,

      if ($checks{'data'} =~ /([^$checks{'regex'}])/) { bail_out("Bad input."); }
      Would appreciate a little more advice :)

      update

      You mean liks this?

      my $default = '-\@\w. '; $checks{'charclass'} ||= $default; my $re = qr/$checks{'charclass'}/is; if ($checks{'data'} =~ /[^$re]/) { bail_out("Bad data"); }

        I wasn't too clear in my original witeup.

        In addition to the "data," "field," "obligatory," "min," and "max" parameters, you have a "regex" parameter that contains a string that contains characters which you put in a character class and then compare against the actual parameter value. This is very limiting, though. Instead of only being able to check if a parameter value contains certain characters, why not extend the sub so you can compare the actual parameter value against entire, complicated regular expressions?

        First, assuming you change the name of the option from "regex" to "value," change this:

        if ($checks{'data'} =~ /([^$checks{'regex'}])/) { bail_out("Bad input."); }

        To this:

        if (defined $checks{'value'}) { if (ref $checks{'value'} eq 'Regexp') { bail_out("Bad input.") unless ($checks{'data'} =~ $checks{'value'}); } else { bail_out("Bad input.") unless ($checks{'data'} eq $checks{'value'}); } }

        Then you'd use the sub as follows:

        $email_address = sanitize( data => param('email'), field => 'Email address', obligatory => 1, min => 9, value => qr/^[a-zA-Z0-9\.]+\@[a-zA-Z0-9\.]+$/, ); $color = sanitize( data => param('color'), field => 'Color', obligatory => 1, min => 3, max => 4, value => qr/^(?:red|green|blue)$/, );

        Notice, now the parameter value is compared against an entire regex as opposed to just a character class. The original example would be rewritten as:

        $username = sanitize( data => param('username'), field => 'Username', obligatory => 1, min => 8, max => 8, value => qr/^[%;&()#\w ]+$/, );

        Yeah, it's a little more typing, but being able to compare parameter values against custom regular expressions is an absolute must. If you know a parameter value can only contain one value, like "on," because of the ref() checking you can just use a string and sidestep the regex engine and its overhead entirely:

        $autosave = sanitize( data => param('autosave'), field => 'Autosave', obligatory => 0, max => 2, value => 'on', # not a regular expression; $checks{'data'} mus +t eq "on" );


        All to often, programmers pick poor names (I'm especially guilty of this), either because they don't put enough thought into them, because they mix their metaphors, or because they come up with creative, long-thought out names that make perfect sense to them but are completely unintuitive to everyone else. To be perfectly honest with you, I think you could have come up with a better name for the sub itself. Your sub checks a value to make sure it's "clean," so you called it sanitize. But sanitize is verb derived from sanitizing, meaning "to clean something" (which makes me wonder why you didn't use the shorter "clean" instead). But your sub doesn't actually "clean" anything. Cleaning would involve converting or changing something, but all your sub does is examine something. The name "sanitize" would be far more appropriate for a sub that, for example, converted potentially dangerous HTML characters like <, >, and & to their HTML entity equivalents and all newline characters to <br>; but even for that, "escape" would probably be more appropriate.

        Maybe you should call the sub "validate_param" or "check_param" (variants on actual names I've used) or, if you really like sanitize, "is_sanitized" instead?