Wonko El Sano has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing a file created by an unkown database and being served up by a Windows server. There are just a couple of lines that have Malformed UTF-8 characters on them, but when you feed one of them into a regex, then it creates a warning message. I could simply not use the -w option, but that isn't a solution that I want permanently because of other problems that may occur in the future with the script. Does anyone know how I might be able to detect a Malformed UTF-8 character before it gets worked on by the regex. Thanks, Wonko El Sano
  • Comment on Malformed UTF-8 characters in Regular Expressions

Replies are listed 'Best First'.
Re: Malformed UTF-8 characters in Regular Expressions
by strat (Canon) on Feb 20, 2003 at 11:57 UTC
    Some time ago, I've got a similar error message (I think, it was with perl5.6): IIRC the Problem was $anything =~ /$string/;

    Since my strings were not unicode string, I was able to solve it by quoting the string, e.g. $anything =~ /\Q$string\E/;

    But I don't know if you have really got the same problem as me.

    Best regards,
    perl -e "s>>*F>e=>y)\*martinF)stronat)=>print,print v8.8.8.32.11.32"

Re: Malformed UTF-8 characters in Regular Expressions
by graff (Chancellor) on Feb 20, 2003 at 02:51 UTC
    Since you are retrieving the strings from a database, it is probably true that perl starts out assuming that the string is just a set of octets (bytes, binary data), not "characters" in the unicode sense -- until you pop it into a regex.

    You don't say what version of Perl you have (5.6.1? 5.8.0?); see whether you have the Encode module, and if you have it, try something like this:

    use Encode; ... my ( $stringFromDB, $uft8string ); # # do whatever it is that queries the database and # assigns a string to $stringFromDB... # eval "\$utf8string = decode( 'utf8', \$stringFromDB, Encode::FB_CROAK +)"; if ( $@ ) { warn "DB value $stringFromDB is Malformed UTF8\n"; } ...
    This tries to "convert" the "octets" in $stringFromDB from utf8 into an "official" utf8 (Perl-internal) string -- in effect, if the data is already valid utf8, nothing changes, but the variable being assigned to will have its "utf8 flag" set (whereas this flag is probably not set in the "octet" string). When the data is malformed, setting the FB_CROAK arg tells decode to die on failure, so you can trap that with eval.

    (As shown above, the "warn" usage might cause some other sort of warning as well, about "wide characters in print statement" or some such, but I haven't tested this specifically.)

Re: Malformed UTF-8 characters in Regular Expressions
by John M. Dlugosz (Monsignor) on Feb 19, 2003 at 23:51 UTC
    You could see if the warning in question can be disabled using no warnings something; scoped to just the regex of interest. Use warnings; instead of -w.

    —John

Re: Malformed UTF-8 characters in Regular Expressions
by John M. Dlugosz (Monsignor) on Feb 19, 2003 at 23:50 UTC
    Force the string into byte mode, then see if it fails to match a regex that finds valid UTF characters. I wrote one a while back but don't know if I could find it. Basically look at the spec: 0-0x7f is ok, or binary 110xxxxx followed by one continuation byte. Now 110xxxxxxx is just 11000000 through 11011111 inclusive, so you can write that as \xC0-\xDF. The continuation is 10xxxxxx. Repeat for the 3 and 4 byte forms: 1110xxxx followed by 2 continuation bytes, and 11110xxx followed by 3 continuation bytes.

    —John