comment on

I'd be interested to know the risks involved.

The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption.

The following change is in current blead:

--- perl-current/pod/perldiag.pod       2007-01-02 19:17:01.000000000 
++0100
+++ mijn/pod/perldiag.pod       2007-03-03 18:12:23.000000000 +0100
@@ -2263,12 +2263,19 @@

 =item Malformed UTF-8 character (%s)

-(S utf8) (F) Perl detected something that didn't comply with UTF-8
-encoding rules.
+(S utf8) (F) Perl detected a string that didn't comply with UTF-8
+encoding rules, even though it had the UTF8 flag on.

-One possible cause is that you read in data that you thought to be in
-UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
-possibility is careless use of utf8::upgrade().
+One possible cause is that you set the UTF8 flag yourself for data th
+at
+you thought to be in UTF-8 but it wasn't (it was for example legacy
+8-bit data). To guard against this, you can use Encode::decode_utf8.
+
+If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by
+te
+sequences are handled gracefully, but if you use C<:utf8>, the flag i
+s
+set without validating the data, possibly resulting in this error
+message.
+
+See also L<Encode/"Handling Malformed Data">.
[download]

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

In reply to Re^4: A UTF8 round trip with MySQL by Juerd
in thread A UTF8 round trip with MySQL by clinton

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.