Re: Untainting text / unicode text

You need to be a little more clear (at least when you give instructions to your web clients) about which form of unicode you intend to support. Overall, utf8 will be best, and probably the easiest, surest way to validate it would be to use the Encode module (in Perl 5.8.x) -- something like this:

# assume that "$octets" is the string that has been recieved 
# from a form, and is purported to be utf8 text:
...
use Encode;
...

my $utf8str;
eval "\$utf8str = encode( 'utf8', \$octets, Encode::FB_CROAK )";
if ( $@ ) {
   # $octets was not really a valid utf8 string
}
...
[download]

Of course, if you'd rather accept some other form of unicode, such as UTF-16LE or UTF16BE, just put one of those names in place of 'utf8' above. (Note that the fixed-width UTF-16 encodings do contain null bytes when conveying characters in the normal ASCII/Latin1 range, U0000 - U00FF.) But just stick with utf8 -- fewer traps.

Since you're not really doing anything "risky" with the text, just the utf8 validation should be a sufficient safeguard -- and it is important to do this, if you want people to post their content in a consistent, meaningful, usable form.

Comment on Re: Untainting text / unicode text Download Code

Replies are listed 'Best First'.
Re: Re: Untainting text / unicode text by fireartist (Chaplain) on Jun 02, 2004 at 08:28 UTC
Yes, I've still to test whether setting the HTML page charset to utf8 causes typed input to be utf8. I really hope it does...	[reply]
Re^3: Untainting text / unicode text by graff (Chancellor) on Jun 02, 2004 at 21:15 UTC
Well, depending on the language involved, keyboarding could be a serious and unavoidable problem... I don't have a clue what browsers do in terms of supporting i18n at the keyboard. Even doing a copy/paste of text that has been composed/displayed in some other app/window could be dicey -- I haven't been there or done that enough times, using anything besides English, to be confident about it.	[reply]