Malformed UTF-8 characters in Regular Expressions

Wonko El Sano has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Malformed UTF-8 characters in Regular Expressions by strat (Canon) on Feb 20, 2003 at 11:57 UTC
Some time ago, I've got a similar error message (I think, it was with perl5.6): IIRC the Problem was `$anything =~ /$string/;` Since my strings were not unicode string, I was able to solve it by quoting the string, e.g. `$anything =~ /\Q$string\E/;` But I don't know if you have really got the same problem as me. Best regards, perl -e "s>>F>e=>y)\martinF)stronat)=>print,print v8.8.8.32.11.32"	[reply] [d/l] [select]
Re: Malformed UTF-8 characters in Regular Expressions by graff (Chancellor) on Feb 20, 2003 at 02:51 UTC
Since you are retrieving the strings from a database, it is probably true that perl starts out assuming that the string is just a set of octets (bytes, binary data), not "characters" in the unicode sense -- until you pop it into a regex. You don't say what version of Perl you have (5.6.1? 5.8.0?); see whether you have the Encode module, and if you have it, try something like this: `use Encode; ... my ( $stringFromDB, $uft8string ); # # do whatever it is that queries the database and # assigns a string to $stringFromDB... # eval "\$utf8string = decode( 'utf8', \$stringFromDB, Encode::FB_CROAK +)"; if ( $@ ) { warn "DB value $stringFromDB is Malformed UTF8\n"; } ...` [download] This tries to "convert" the "octets" in $stringFromDB from utf8 into an "official" utf8 (Perl-internal) string -- in effect, if the data is already valid utf8, nothing changes, but the variable being assigned to will have its "utf8 flag" set (whereas this flag is probably not set in the "octet" string). When the data is malformed, setting the FB_CROAK arg tells decode to die on failure, so you can trap that with eval. (As shown above, the "warn" usage might cause some other sort of warning as well, about "wide characters in print statement" or some such, but I haven't tested this specifically.)	[reply] [d/l]
Re: Malformed UTF-8 characters in Regular Expressions by John M. Dlugosz (Monsignor) on Feb 19, 2003 at 23:50 UTC
Force the string into byte mode, then see if it fails to match a regex that finds valid UTF characters. I wrote one a while back but don't know if I could find it. Basically look at the spec: 0-0x7f is ok, or binary 110xxxxx followed by one continuation byte. Now 110xxxxxxx is just 11000000 through 11011111 inclusive, so you can write that as \xC0-\xDF. The continuation is 10xxxxxx. Repeat for the 3 and 4 byte forms: 1110xxxx followed by 2 continuation bytes, and 11110xxx followed by 3 continuation bytes. —John	[reply]
Re: Malformed UTF-8 characters in Regular Expressions by John M. Dlugosz (Monsignor) on Feb 19, 2003 at 23:51 UTC
You could see if the warning in question can be disabled using no warnings something; scoped to just the regex of interest. Use warnings; instead of -w. —John	[reply]