Common untainting methods?

Wally Hartshorn has asked for the wisdom of the Perl Monks concerning the following question:

Recently I wrote a small Perl program to allow people to submit information that would be stored in a MySQL database. Not long after it was put into production, I was told that it wasn't accepting one person's input. I got a copy of the input, tried it out, and quickly saw what the problem was.

The input form has a text box where the submitter is allowed to enter free-form comments. In this particular case, the user was entering multiple paragraphs. The little untainting routine I had written for the program neglected to allow \r and \n in the input. So I added those to the regex as allowed characters and tested it out.

That allowed the multiple paragraphs, but the input, because it had been copied and pasted from a Microsoft Word document, contained some special characters -- single open quotes, single close quotes, emdashes, etc. My regex hadn't taken those into account. So I decided to add those to the regex as allowed characters.

As I was doing this, I figured I ought to add the other likely special characters -- copyright symbols, trademark symbols, ellipses, etc. But as I was doing this, I began to wonder:

What characters am I really supposed to be allowing/excluding?

I'm not taking their input and passing it to a system() command for execution. I'm just taking it and passing it to MySQL (actually Class::DBI) for entering into the database. I know that there are ways to exploit this type of situation to do other database commands, but I don't know how those work, so I don't know what I need to prevent.

When Perl gurus are asked "how do I untaint stuff", they generally answer with "it depends". I understand that, but it seems like there ought to be some common ways of untainting input data in common situations -- e.g. "do this before sending something to a MySQL database" and "do this before using something as an email address".

Are there any such standard methods or do I really have to reinvent the wheel (after spending time researching each particular road) every single time? Failing that, could someone at least tell me whether the following is exceptionally stupid for data that will go into a MySQL text field (doing all database access via Class::DBI)?

    # Keep only the following characters
    #
    # [:print:]   printable characters
    # \n          end-of-line characters
    # \x85        MS Word ellipses
    # \x91        MS Word single opening quote
    # \x92        MS Word single closing quote
    # \x93        MS Word double opening quote
    # \x94        MS Word double closing quote
    # \x96        MS Word endash
    # \x97        MS Word emdash
    # \x99        MS Word trademark symbol
    # \xA7        MS Word section symbol
    # \xA9        MS Word copyright symbol
    # \xAE        MS Word registered symbol

    $freeformtext =~ s/
        [^[:print:]\n\x85\x91\x92\x93\x94\x96\x97\x99\xA7\xA9\xAE]
    //gx;
[download]

Thanks in advance (as I prepare to be told "it depends"). :-)

Update: Replaced code snippet with something that's actually valid (although not necessarily correct). :-)

Wally Hartshorn

(Plug: Visit JavaJunkies, PerlMonks for Java)

Comment on Common untainting methods? Download Code

Replies are listed 'Best First'.
Re: Common untainting methods? by Abigail-II (Bishop) on Nov 25, 2003 at 23:09 UTC
When Perl gurus are asked "how do I untaint stuff", they generally answer with "it depends". I understand that, but it seems like there ought to be some common ways of untainting input data in common situations -- e.g. "do this before sending something to a MySQL database" and "do this before using something as an email address". Well, the answer is it depends. For inserting it in MySQL, it depends on two things. First, how are you inputting the data into the database? If you are using placeholders, you shouldn't have any problems with the insert itself. So, then you can allow anything. But what are you going to do with the data afterwards? If the data in the database is supposed to be trustworthy, you may, or may not, have to filter out characters, or substrings, depending on what you are going to do with it. It's the same for email addresses. Email addresses themselves aren't dangerous, not even incorrect ones. But they may become dangerous depending on how you use them - even legal email addresses. To decide how you properly untaint data, it's not relevant what the data is (or isn't), but what is important is how you are going to use the data. Abigail	[reply]
Re: Common untainting methods? by tachyon (Chancellor) on Nov 25, 2003 at 23:58 UTC
You can generally store any arbitrary binary data you want in a DB (but that of course depends ;-) Anyway to ensure the insert does not fail you need to quote some chars. With Perl and DBI all you actually need to do is is use placeholders ie: `my $data = "some 'arbitrary' data....."; my $more_data = "\007\000\007"; $sth = $dbh->prepare( 'INSERT INTO table (col1, col2) VALUES (?,?)' ); $sth->execute( $data, $more_data);` [download] The next it depends comes from what you plan on doing with the data when you RETRIEVE it from the DB..... If you are going to do `open $data; eval $data; system $data; etc` then you do need to untaint it. However whether you untaint it on DB insertion or not I would personally still redo the untaint prior to use, that way if someone corrupts your DB data it will not cause you undue grief. cheers tachyon	[reply] [d/l] [select]
Re: Common untainting methods? by sgifford (Prior) on Nov 26, 2003 at 06:58 UTC
An inbetween step would be to allow completely safe characters, disallow or escape completely unsafe characters, and for all others just delete them from the input. That's pretty close to the right thing to do for a lot of weird input characters. You said in your post you don't understand what you're protecting against. Here's a little bit of the flavor of SQL injection attacks. At the most fundamental level, you're trying to prevent a statement like this: `my $sql = "SELECT * FROM table WHERE NAME=$name"` [download] from becoming nasty if the user enters something like `Scott; DELETE FROM table` for their name. So, you change that to: `my $sql = "SELECT * FROM table WHERE NAME='$name'";` [download] That works for our simple case, but now the user can enter their name as `Tom'; DELETE FROM table; SELECT * FROM table WHERE NAME='Bob`, which will result in the SQL statement: `SELECT * FROM table WHERE NAME='Tom'; DELETE FROM table; SELECT * FROM table WHERE NAME='Bob'` [download] So now you have to escape quotes, which you can do with the `$dbh->quote` function, or by using placeholders as others have described. Other characters that are dangerous to your database will depend on the database, but unless your DBD driver is really crappy, `$dbh->quote` and placeholders should both be safe. The remaining dangers, then, depend on what you do with the data. If you're displaying it on a Web page, you want to make sure it doesn't contain HTML tags, particular JavaScript code. If you're using it to send an email, you want to make sure it doesn't have any characters special to the mailing program (for example sending a `~` to `/bin/mail`, the source of the security bug in setuidperl IIRC). One school of thought says all data in the DB should be trustworthy, so you should make sure it doesn't have anything dangerous for any application. Another school of thought says put whatever you want in the DB, and the application using it is responsible for making sure it untaints it on the way out. You need to make sure that you treat data from the DB as tainted in this case. The most paranoid school of thought says you should do both---stop characters that are likely to be dangerous from getting into the DB, and applications using the data check to make sure the data really is safe. That last one is what I usually try to do.	[reply] [d/l] [select]
Re: Common untainting methods? by Anonymous Monk on Nov 26, 2003 at 04:13 UTC
When Perl gurus are asked "how do I untaint stuff", they generally answer with "it depends". Well, it doesn't "depend". There is only one (correct) way to untaint data, and that is by matching it. `my $tainted = $ENV{PATH}; my $untainted = $1 if $tainted =~ /^(.)$/;` [download] What you allow to match is the part that "depends" (it's called data validation). In your case, it doesn't look like you need to validate data at all (you may need to escape* it if you're gonna display it via html, but you should allow the user to enter everything). PS - on a sidenote, you can untain values like `my $tainted = $ENV{PATH}; my($untainted)= ( keys %{ { $ENV{PATH} => undef }} );` [download] but you cannot rely on that behaviour.	[reply] [d/l] [select]
Re: Re: Common untainting methods? by sgifford (Prior) on Nov 26, 2003 at 06:39 UTC
That's the correct mechanism for untainting data, but if you actually want to get any of the security benefits of taint checking, you'd need to do better than that.	[reply]
Re: Re: Re: Common untainting methods? by Anonymous Monk on Nov 26, 2003 at 17:06 UTC
Right, but in this case, he doesn't need to untaint data at all (escape ne untaint).	[reply]