fireartist has asked for the wisdom of the Perl Monks concerning the following question:

I use taint mode in all of my CGI programs and am starting to wonder if I'm being too restrictive in some cases.
Generally when 'free text' input is required, I use a regex to ensure it matches \w and a small number of punctuation characters, and substitute line-breaks with <br>'s.

The data I'm taking about here is stuff that will be getting stuffed into a database (using placeholders) and getting displayed again as HTML (going through CGI's escapeHTML method), it will not be used as a filename, sent to system calls, etc.

I'm now in the position of wanting to allow similarly 'free text' UNICODE input and I don't know realistically what to allow.
I'm quite tempted to allow anything other than the null byte, which is the only thing I can think of that might mess up either the database insertion or the HTML display.
However, I've always practiced making sure the data contains only what I do want to allow, not what I don't.

I've super searched for "taint unicode" and haven't found anything that really helps.
I've read the core perl unicode docs and understand how to untaint using unicode character classes

Can anyone give me some advice or real-world examples?
Does perlmonks.org use taint mode and how does it untaint the Seekers of Perl Wisdom "Your question" input?

Replies are listed 'Best First'.
Re: Untainting text / unicode text
by graff (Chancellor) on Jun 02, 2004 at 04:11 UTC
    You need to be a little more clear (at least when you give instructions to your web clients) about which form of unicode you intend to support. Overall, utf8 will be best, and probably the easiest, surest way to validate it would be to use the Encode module (in Perl 5.8.x) -- something like this:
    # assume that "$octets" is the string that has been recieved # from a form, and is purported to be utf8 text: ... use Encode; ... my $utf8str; eval "\$utf8str = encode( 'utf8', \$octets, Encode::FB_CROAK )"; if ( $@ ) { # $octets was not really a valid utf8 string } ...
    Of course, if you'd rather accept some other form of unicode, such as UTF-16LE or UTF16BE, just put one of those names in place of 'utf8' above. (Note that the fixed-width UTF-16 encodings do contain null bytes when conveying characters in the normal ASCII/Latin1 range, U0000 - U00FF.) But just stick with utf8 -- fewer traps.

    Since you're not really doing anything "risky" with the text, just the utf8 validation should be a sufficient safeguard -- and it is important to do this, if you want people to post their content in a consistent, meaningful, usable form.

      Yes, I've still to test whether setting the HTML page charset to utf8 causes typed input to be utf8.
      I really hope it does...
        Well, depending on the language involved, keyboarding could be a serious and unavoidable problem... I don't have a clue what browsers do in terms of supporting i18n at the keyboard.

        Even doing a copy/paste of text that has been composed/displayed in some other app/window could be dicey -- I haven't been there or done that enough times, using anything besides English, to be confident about it.