You need to be a little more clear (at least when you give instructions to your web clients) about which form of unicode you intend to support. Overall, utf8 will be best, and probably the easiest, surest way to validate it would be to use the Encode module (in Perl 5.8.x) -- something like this:
# assume that "$octets" is the string that has been recieved # from a form, and is purported to be utf8 text: ... use Encode; ... my $utf8str; eval "\$utf8str = encode( 'utf8', \$octets, Encode::FB_CROAK )"; if ( $@ ) { # $octets was not really a valid utf8 string } ...
Of course, if you'd rather accept some other form of unicode, such as UTF-16LE or UTF16BE, just put one of those names in place of 'utf8' above. (Note that the fixed-width UTF-16 encodings do contain null bytes when conveying characters in the normal ASCII/Latin1 range, U0000 - U00FF.) But just stick with utf8 -- fewer traps.

Since you're not really doing anything "risky" with the text, just the utf8 validation should be a sufficient safeguard -- and it is important to do this, if you want people to post their content in a consistent, meaningful, usable form.


In reply to Re: Untainting text / unicode text by graff
in thread Untainting text / unicode text by fireartist

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.