comment on

I'm wondering why you would decide to use utf8::upgrade() on tainted strings -- as opposed to using Encode::decode(), which offers much better controls for handling malformed character data. (If the data is tainted, how can you assume that it can always be treated as valid unicode text?)

If you are doing taint checking at all, and you need to convert a tainted string to utf8 (or validate and flag it as a utf8 string), it would seem much more sensible to handle it like this:

my $rawstring = ...;  # coming from a cgi param or whatever
my $utfstring;

eval { $utfstring = decode( "utf8", $rawstring, Encode::FB_CROAK ) }

if ( $@ ) {
    # do something sensible given that $rawstring is invalid
    # (i.e. cannot be converted successfully to utf8)
}
[download]

If your expected input (i.e. the tainted data that should pose no difficulty for proper untainting) is not a utf8 octet stream, then all the more reason to use Encode, because as perldoc utf8 says:

Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes.

(emphasis in the original) But even if well-behaved input is expected to be utf8 octets, the fact that it's tainted means "don't count on that!"

In other words, don't use utf8::upgrade() on tainted strings. Period.

(update: added the "perldoc utf8" link, to clarify the source of the quotation)

In reply to Re: utf8::upgrade is untainting by graff
in thread utf8::upgrade is untainting by tinita

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.