Re: Malformed UTF-8 characters in Regular Expressions

Since you are retrieving the strings from a database, it is probably true that perl starts out assuming that the string is just a set of octets (bytes, binary data), not "characters" in the unicode sense -- until you pop it into a regex.

You don't say what version of Perl you have (5.6.1? 5.8.0?); see whether you have the Encode module, and if you have it, try something like this:

use Encode;
...
my ( $stringFromDB, $uft8string );
#
# do whatever it is that queries the database and
# assigns a string to $stringFromDB...
#
eval "\$utf8string = decode( 'utf8', \$stringFromDB, Encode::FB_CROAK 
+)";

if ( $@ ) {
    warn "DB value $stringFromDB is Malformed UTF8\n";
}
...
[download]

This tries to "convert" the "octets" in $stringFromDB from utf8 into an "official" utf8 (Perl-internal) string -- in effect, if the data is already valid utf8, nothing changes, but the variable being assigned to will have its "utf8 flag" set (whereas this flag is probably not set in the "octet" string). When the data is malformed, setting the FB_CROAK arg tells decode to die on failure, so you can trap that with eval.

(As shown above, the "warn" usage might cause some other sort of warning as well, about "wide characters in print statement" or some such, but I haven't tested this specifically.)

Comment on Re: Malformed UTF-8 characters in Regular Expressions Download Code