tinita has asked for the wisdom of the Perl Monks concerning the following question:

hello monks,

today i realized that doing utf8::upgrade() on tainted variables untaints them. is that intented behaviour? should i use Taint to taint the variables again?

Replies are listed 'Best First'.
Re: utf8::upgrade is untainting
by RMGir (Prior) on Aug 15, 2007 at 09:30 UTC
    It's certainly not documented behaviour. You might want to perlbug it - they can either document it or fix it.

    It's not _surprising_ behaviour, though - it's hard to see how utf8::upgrade could do its job without a regex or 2...


    Mike
Re: utf8::upgrade is untainting
by graff (Chancellor) on Aug 16, 2007 at 05:48 UTC
    I'm wondering why you would decide to use utf8::upgrade() on tainted strings -- as opposed to using Encode::decode(), which offers much better controls for handling malformed character data. (If the data is tainted, how can you assume that it can always be treated as valid unicode text?)

    If you are doing taint checking at all, and you need to convert a tainted string to utf8 (or validate and flag it as a utf8 string), it would seem much more sensible to handle it like this:

    my $rawstring = ...; # coming from a cgi param or whatever my $utfstring; eval { $utfstring = decode( "utf8", $rawstring, Encode::FB_CROAK ) } if ( $@ ) { # do something sensible given that $rawstring is invalid # (i.e. cannot be converted successfully to utf8) }
    If your expected input (i.e. the tainted data that should pose no difficulty for proper untainting) is not a utf8 octet stream, then all the more reason to use Encode, because as perldoc utf8 says:
    Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes.

    (emphasis in the original) But even if well-behaved input is expected to be utf8 octets, the fact that it's tainted means "don't count on that!"

    In other words, don't use utf8::upgrade() on tainted strings. Period.

    (update: added the "perldoc utf8" link, to clarify the source of the quotation)

      I'm wondering why you would decide to use utf8::upgrade() on tainted strings
      it's not my code - it's in a framework i'm using, and i decided to find out what happens if i add -T to my scripts, and nothing happened, and searching for the reason i stumbled over this upgrade().
      so thanks for your comments, i'll check if Encode could be used instead of utf8::upgrade.