CGI::Taintless - a request for comments

The problem

In Is force_untaint in HTML::Template overkill? I asked for the help of the monastery as to whether a particular form of detainting was worthwhile. Having established that it is, it seems to be that there is a gap in current tools here. I now have a proof of concept, and I wish the further help of the monastery in weighing its worth. I am starting a new thread to improve clarity. To recap the problem, it is a natural paradigm with HTML::Template to have a template like:

    <html>
        <head><title>test.tmpl</title></head>
        <body>
            <TMPL_VAR NAME="form">
            <TMPL_VAR NAME="hidden">
            <submit/>
            </form>
        </body>
    </html>
[download]

We will then use the CGI object to fill in the template variables. The issue is that the values are derived from the CGI input parameters and so are tainted. I think the use of HTML::Template here makes the issues slightly harder to see so instead I will work off this test script.

#!/usr/bin/perl -wT
# test3a.pl
use CGI;
use strict;
use Test::Simple tests => 11;
use Scalar::Util qw(tainted);
my $c = CGI->new();

pr($c->header());
pr($c->start_html());
pr("print world");
pr($c->param());
pr($c->param("blah"));
pr($c->hidden(-name=>"blah"));
pr($c->start_form());
pr($c->self_url());
pr($c->submit());
pr($c->end_form());
pr($c->end_html());

sub pr {
    my @thingy = shift;
    foreach my $t (@thingy) {
        ok(!tainted($t), $t);
    }
}
[download]

If you now run perl -T test3a.pl it runs without error. But if you run perl -T test3a.pl blah=hello you get several errors as follows:

C:\Users\SilasTheMonk\Downloads\Documents\paranoia>perl -T test3a.pl b
+lah=hello
1..11
ok 1 - Content-Type: text/html; charset=ISO-8859-1
#
#
ok 2 - <!DOCTYPE html
#       PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
#        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
# <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en
+-US">
# <head>
# <title>Untitled Document</title>
# <meta http-equiv="Content-Type" content="text/html; charset=iso-8859
+-1" />
# </head>
# <body>
#
ok 3 - print world
not ok 4 - blah
#   Failed test 'blah'
#   at test3a.pl line 23.
not ok 5 - hello
#   Failed test 'hello'
#   at test3a.pl line 23.
not ok 6 - <input type="hidden" name="blah" value="hello"  />
#   Failed test '<input type="hidden" name="blah" value="hello"  />'
#   at test3a.pl line 23.
not ok 7 - <form method="post" action="http://localhost?blah=hello" en
+ctype=
tipart/form-data">
#
#   Failed test '<form method="post" action="http://localhost?blah=hel
+lo" en
e="multipart/form-data">
# '
#   at test3a.pl line 23.
not ok 8 - http://localhost?blah=hello
#   Failed test 'http://localhost?blah=hello'
#   at test3a.pl line 23.
ok 9 - <input type="submit" name=".submit" />
ok 10 - </form>
ok 11 -
# </body>
# </html>
# Looks like you failed 5 tests of 11.
[download]

What I think we should have is an easy way of detainting the outputs from the CGI module. The detainting needs to be easy and reliable so that it will actually be used. Often it will be best if it is customized to the particular web-site. The CGI::Untaint module is some help but fails in the following respects:

Quite a few moving parts are required to customize a regular expression using CGI::Untaint. You need to create a module inheriting from, say, CGI::Untaint::object and configure the INCLUDE_PATH to your local repository of such modules.
My problem with the templating example is that it is all very well, getting the paramters, detainting them and using them. When you call a function like start_form, the CGI module gets the paramters BUT DOES NOT DETAINT them. The output from a function like start_form or self_url is much harder to parse. CGI::Untaint does not obviously help with this.

Where the CGI::Untaint does help is that it provides a library of detainting recipes, and some of the targets are tricky so it is better to have it done once correctly.

My proposal

package CGI::Taintless;

use vars qw(@ISA);

sub new {
    my $class = shift;

    # inherit from a CGI hash-based object, to which we default many C
+GI operations
    my $self = shift; # must be something conforming to CGI interface
    my $uclass = ref($self);
    @ISA = ($uclass);

    die "I am going in circles" if $uclass eq $class;

    $self->{__Taintless_taint_handlers} = shift || {}; # must be a par
+am => taint handler mapping
    my $max_param_len = shift || 10;
    $self->{__Taintless_param_check} = "^\(\[\\w\\\_\]\{1\,$max_param_
+len\}\)\$";

    bless $self, $class;
    return $self;
}

sub param {
    my $self = shift;
    if (scalar(@_) == 0) {
        # must only allow alphanumeric parameters for which we have ta
+int handlers
        my @params = $self->SUPER::param();
        my @filtered = ();
        foreach my $p (@params) {
            if ($self->get_re($p)) {
                push @filtered, $1 if $p =~ /$self->{__Taintless_param
+_check}/;
            }
        }
        return @filtered;
    }
    elsif (scalar(@_) == 1) { # will be tainted
        my $p = shift;
        my $v = $self->SUPER::param($p);
        return undef unless defined($v); # need this line to deal with
+ the .cgifields parameter
        my $re = $self->get_re($p) || die "Cannot find taint handler f
+or $p";
        return $1 if $v =~ /$re/;
        die "$p does not pass taint check"
    }
    else {
        $self->SUPER::param(@_);
    }

}

sub get_re {
    my $self = shift;
    my $p = shift;
    if (exists $self->{__Taintless_taint_handlers}->{$p}) {
        return $self->{__Taintless_taint_handlers}->{$p};
    }
    elsif (exists $self->{__Taintless_taint_handlers}->{-DEFAULT_HANDL
+ER}) {
        return $self->{__Taintless_taint_handlers}->{-DEFAULT_HANDLER}
+;
    }
    return undef;
}

1
[download]

I tried this out with the following script:

#!/usr/bin/perl -wT
# test3.pl
use CGI;
use Test::Simple tests => 11;
use Scalar::Util qw(tainted);
use lib qw(...........); # set this as appropriate.
use CGI::Taintless;
my $q = CGI->new();
my $c = CGI::Taintless->new($q, {blah=>'^([helo]+)$'});

pr($c->header());
pr($c->start_html());
pr("print world");
pr($c->param());
pr($c->param("blah"));
pr($c->hidden(-name=>"blah"));
pr($c->start_form());
pr($c->self_url());
pr($c->submit());
pr($c->end_form());
pr($c->end_html());

sub pr {
    my @thingy = shift;
    foreach my $t (@thingy) {
        ok(!tainted($t), $t);
    }
}
[download]

Then all tests are passed.

My questions

Do Monks agree that this is a good approach and if not why not?
Can anyone think of any case or situation that will not be covered?
What would be the best way of supporting common detainting requirements?

Comment on RFC: CGI::Taintless Select or Download Code

Replies are listed 'Best First'.
Re: RFC: CGI::Taintless by moritz (Cardinal) on Sep 23, 2008 at 07:35 UTC
`$self->{__Taintless_param_check} = "^$\[\\w\\\_\]\{1\,$max_param_len\}$\$";` I think it's hard to get the number of backslashes right. Maybe you want to use `qr{...}` instead? Also `$` in a regex allows a trailing newline before the end of the string - is this what you want? If not, use `\z` instead. Somehow I think that `CGI::Taintless` isn't a very good name at all. Your focus should be on validating CGI params, with the side effect of untainting them. Maybe something along the lines of `CGI::RegexValidate` might be more appropriate as a name? Note that passing unvalidated CGI params to the output is a security risk, because it allows cross-site-scripting - unless you escape it. So the tests in your sample program fail for a good reason. Usually I take the approach of trying to validate as little as possible, because I don't know how to validate. If there's a field for the real name of a user, how do you validate that? And if you do it, are you sure that all possible names in all human languages (of which there are quite many) are actually accepted? How do you test that? Instead I use placeholders for my DB queries, and set the `default_escape` option of HTML::Template (or HTML::Template::Compiled) to `html`, and care as little as possible about the contents of the fields. So for me this module makes only very little sense, but of course I can't speak for everyone, and I can well imagine that other perl hackers might find it useful.	[reply] [d/l] [select]
Re^2: RFC: CGI::Taintless by blazar (Canon) on Sep 24, 2008 at 13:32 UTC
Somehow I think that `CGI::Taintless` isn't a very good name at all. Your focus should be on validating CGI params, with the side effect of untainting them. Maybe something along the lines of `CGI::RegexValidate` might be more appropriate as a name? I personally believe that validating with regexen is fine and all, but the OP may want to extend his module to a more generic validating one allowing e.g. subs, arrayrefs (possibly to be matched against `~~`) and whatever. Then I do know that actual untainting is performed by means of regexen, but then indeed it could be done so anyway, if those other "objects" match. Then he could name his module `CGI::Validate::Untaint`. PS: [OT] I lost track of the `~~`'s return value in list context issue, which you brought up to p5p: how did it end? `--` If you can't understand the incipit, then please check the IPB Campaign.	[reply] [d/l] [select]
Re^3: RFC: CGI::Taintless (~~ in scalar context) by moritz (Cardinal) on Sep 24, 2008 at 14:31 UTC
I lost track of the ~~'s return value in list context issue, which you brought up to p5p: how did it end? The answer in short was "we don't know if it's a bug or a feature because we don't know what ~~ should really do". (Not literally, though). The fine p5p hackers long wanted to review smartmatch semantics, and this question encouraged them to actually do it. aristotle is now investigating how the Perl 6 smartmatch semantics changed since the perl 5 version was implemented, and plans to give p5p some feedback. Then they (or we) will try to decide what the best semantics for Perl 5 are.	[reply]
Re: RFC: CGI::Taintless by Anonymous Monk on Sep 23, 2008 at 09:18 UTC
Using your approach I would prefer this Read more... (4 kB) Now param will return untainted values, as will start_form ... But I'd rather stick with CGI::FormBuilder and/or Data::FormValidator. Since your module seems to be heading the same direction you should see proposal for HTML::FormValidator upgrade to get an idea of how FormValidator came about, you might reconsider starting from scratch :)	[reply] [d/l]
Re^2: RFC: CGI::Taintless by SilasTheMonk (Chaplain) on Sep 23, 2008 at 20:30 UTC
Lots to think about there. Thanks guys. Update 12 Oct 2008 I have been busy with other things recently - and those busy things are making this issue ever more pressing. I can find no module which allows me to choose cheaply between leveraging the wisdom of others and customizing something to my special circumstances, and also functions as a drop in replacement for CGI.pm. As such I think I am very likely to take this further.	[reply]

Replies are listed 'Best First'.
Re: RFC: CGI::Taintless by moritz (Cardinal) on Sep 23, 2008 at 07:35 UTC
`$self->{__Taintless_param_check} = "^\(\[\\w\\\_\]\{1\,$max_param_len\}\)\$";` I think it's hard to get the number of backslashes right. Maybe you want to use `qr{...}` instead? Also `$` in a regex allows a trailing newline before the end of the string - is this what you want? If not, use `\z` instead. Somehow I think that `CGI::Taintless` isn't a very good name at all. Your focus should be on validating CGI params, with the side effect of untainting them. Maybe something along the lines of `CGI::RegexValidate` might be more appropriate as a name? Note that passing unvalidated CGI params to the output is a security risk, because it allows cross-site-scripting - unless you escape it. So the tests in your sample program fail for a good reason. Usually I take the approach of trying to validate as little as possible, because I don't know how to validate. If there's a field for the real name of a user, how do you validate that? And if you do it, are you sure that all possible names in all human languages (of which there are quite many) are actually accepted? How do you test that? Instead I use placeholders for my DB queries, and set the `default_escape` option of HTML::Template (or HTML::Template::Compiled) to `html`, and care as little as possible about the contents of the fields. So for me this module makes only very little sense, but of course I can't speak for everyone, and I can well imagine that other perl hackers might find it useful.	[reply] [d/l] [select]
Re^2: RFC: CGI::Taintless by blazar (Canon) on Sep 24, 2008 at 13:32 UTC
Somehow I think that `CGI::Taintless` isn't a very good name at all. Your focus should be on validating CGI params, with the side effect of untainting them. Maybe something along the lines of `CGI::RegexValidate` might be more appropriate as a name? I personally believe that validating with regexen is fine and all, but the OP may want to extend his module to a more generic validating one allowing e.g. subs, arrayrefs (possibly to be matched against `~~`) and whatever. Then I do know that actual untainting is performed by means of regexen, but then indeed it could be done so anyway, if those other "objects" match. Then he could name his module `CGI::Validate::Untaint`. PS: [OT] I lost track of the `~~`'s return value in list context issue, which you brought up to p5p: how did it end? `--` If you can't understand the incipit, then please check the IPB Campaign.	[reply] [d/l] [select]
Re^3: RFC: CGI::Taintless (~~ in scalar context) by moritz (Cardinal) on Sep 24, 2008 at 14:31 UTC
I lost track of the ~~'s return value in list context issue, which you brought up to p5p: how did it end? The answer in short was "we don't know if it's a bug or a feature because we don't know what ~~ should really do". (Not literally, though). The fine p5p hackers long wanted to review smartmatch semantics, and this question encouraged them to actually do it. aristotle is now investigating how the Perl 6 smartmatch semantics changed since the perl 5 version was implemented, and plans to give p5p some feedback. Then they (or we) will try to decide what the best semantics for Perl 5 are.	[reply]
Re: RFC: CGI::Taintless by Anonymous Monk on Sep 23, 2008 at 09:18 UTC
Using your approach I would prefer this Read more... (4 kB) Now param will return untainted values, as will start_form ... But I'd rather stick with CGI::FormBuilder and/or Data::FormValidator. Since your module seems to be heading the same direction you should see proposal for HTML::FormValidator upgrade to get an idea of how FormValidator came about, you might reconsider starting from scratch :)	[reply] [d/l]
Re^2: RFC: CGI::Taintless by SilasTheMonk (Chaplain) on Sep 23, 2008 at 20:30 UTC
Lots to think about there. Thanks guys. Update 12 Oct 2008 I have been busy with other things recently - and those busy things are making this issue ever more pressing. I can find no module which allows me to choose cheaply between leveraging the wisdom of others and customizing something to my special circumstances, and also functions as a drop in replacement for CGI.pm. As such I think I am very likely to take this further.	[reply]

CGI::Taintless - a request for comments

The problem

My proposal

My questions

Update 12 Oct 2008