untainting and locales and internationalisation

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

The perlsec and perllocale documentation pages make it very clear that if a program does:

use locale;

Then untainting will not work if the regular expression used to untaint contains a character class - because under taint mode perl does not trust the locale information on the host system (if i understand correctly).

This makes sense, but what I can't find are any examples or documentation on the correct way to proceed when one wants to untaint text which may contain characters such as accented letters.

It sounds as if I somehow need to:
a) stop doing use locale
b) (re)define the common character classes such as \w myself to include things like accented characters

But I don't know how to do this, and i'm not really sure whether this is definitely the way to go.

Someone must have come across this problem previously. I would really appreciate your advice/guidence. It feels like I'm missing something fundamental

Surely many Perl web based applications have to untaint data that contains non english characters.

What is the correct secure way to untaint this data?

Comment on untainting and locales and internationalisation Select or Download Code

Replies are listed 'Best First'.
Re: untainting and locales and internationalisation by dtr (Scribe) on Aug 24, 2005 at 14:00 UTC
I believe the answer is to use the extended character classes (eg ":alnum:"), and to also say "use utf8" within your script.	[reply]
Re: untainting and locales and internationalisation by Anonymous Monk on Aug 26, 2005 at 19:39 UTC
You must define your re to contain the characters you want EXPLICITLY. Which is better from a security point of view anyway. If you are not sure if a char is in a character class or not (which means it changes according to locale) then do something like `$re=join("",grep(/^\w$/,split(/(\w)/,"All chars you are prepared to de +al with")));` [download]	[reply] [d/l]