I'm wondering why you would decide to use utf8::upgrade() on tainted strings -- as opposed to using Encode::decode(), which offers much better controls for handling malformed character data. (If the data is tainted, how can you assume that it can always be treated as valid unicode text?)
If you are doing taint checking at all, and you need to convert a tainted string to utf8 (or validate and flag it as a utf8 string), it would seem much more sensible to handle it like this:
my $rawstring = ...; # coming from a cgi param or whatever
my $utfstring;
eval { $utfstring = decode( "utf8", $rawstring, Encode::FB_CROAK ) }
if ( $@ ) {
# do something sensible given that $rawstring is invalid
# (i.e. cannot be converted successfully to utf8)
}
If your expected input (i.e. the tainted data that should pose no difficulty for proper untainting) is not a utf8 octet stream, then all the more reason to use Encode, because as perldoc utf8 says:
Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes.
(emphasis in the original) But even if well-behaved input is expected to be utf8 octets, the fact that it's tainted means "don't count on that!"
In other words, don't use utf8::upgrade() on tainted strings. Period.
(update: added the "perldoc utf8" link, to clarify the source of the quotation) |