comment on

Since you are retrieving the strings from a database, it is probably true that perl starts out assuming that the string is just a set of octets (bytes, binary data), not "characters" in the unicode sense -- until you pop it into a regex.

You don't say what version of Perl you have (5.6.1? 5.8.0?); see whether you have the Encode module, and if you have it, try something like this:

use Encode;
...
my ( $stringFromDB, $uft8string );
#
# do whatever it is that queries the database and
# assigns a string to $stringFromDB...
#
eval "\$utf8string = decode( 'utf8', \$stringFromDB, Encode::FB_CROAK 
+)";

if ( $@ ) {
    warn "DB value $stringFromDB is Malformed UTF8\n";
}
...
[download]

This tries to "convert" the "octets" in $stringFromDB from utf8 into an "official" utf8 (Perl-internal) string -- in effect, if the data is already valid utf8, nothing changes, but the variable being assigned to will have its "utf8 flag" set (whereas this flag is probably not set in the "octet" string). When the data is malformed, setting the FB_CROAK arg tells decode to die on failure, so you can trap that with eval.

(As shown above, the "warn" usage might cause some other sort of warning as well, about "wide characters in print statement" or some such, but I haven't tested this specifically.)

In reply to Re: Malformed UTF-8 characters in Regular Expressions by graff
in thread Malformed UTF-8 characters in Regular Expressions by Wonko El Sano

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.