in reply to How to know if a string has UTF-8?
The second case is trickier: given that a string contains non-ascii data, how would you know whether it's utf8 or something else? Here, the Encode module in Perl 5.8 would provide the best means for solving this -- though I do not agree with chromatic's suggestion (I'll reply to that separately).if ( /[^\x00-\x7f]/ ) # true if $_ contains any non-ascii character
If you are using 5.8, and want to test whether an arbitrary string value contains valid utf8 data, do it like this:
But in this case, bear in mind that every pure-ascii string constitutes a valid utf8 string -- so the first test mentioned above (testing for non-ascii characters) would still be needed, if you have to know the answer to both questions (1 and 2 above).use Encode; ... # load the string into $_, then: my $test; eval "\$test = decode( 'utf8', \$_, Encode::FB_CROAK )"; if ( $@ ) { # Encode would fail/die if $_ was not a valid utf8 string }
If you don't have Perl 5.8 (hence no Encode module), there may be some means to check for valid utf8 content using tools in 5.6, but I'm not personally familiar with the unicode support in that version.
The definition of "valid utf8" is of course quite specific, and it wouldn't be hard to roll your own test for it, even with pre-unicode versions of Perl. Basically, in order to qualify as utf8, a string must either be completely ascii (nothing has high-bit set), or else the bytes with high-bit set come in pairs or triplets, and can be checked as follows:
There are probably more elegant or concise ways to lay out that logic, but that's the basic utf8 rule set in a nutshell. (The official Unicode Consortium spec actually covers 32-bit code points rendered in utf8, as well as the common 16-bit code points, and the above logic probably doesn't get that part right -- but your chances of encountering a 32-bit code point in utf8 are pretty much nil, I think.)my @bytes = unpack( "C*", $_ ); # break string into bytes my $widec = ""; # accumulate valid utf8 bytes here my $width = 0; # keep track of how many bytes to accumulate for my $b ( @bytes ) { if (( $b & 0xf0 ) == 0xe0 or # high 4 bits are 1110 ( $b & 0xe0 ) == 0xc0 ) # high 3 bits are 110 { # either condition represents the start of a multibyte-char die "Bad byte sequence\n" if ( $width ); $width = (( $b & 0xe0 ) == 0xe0 ) ? 3 : 2; $widec .= chr( $b ); } elsif (( $b & 0xc0 ) == 0x80 ) # high 2 bits are 10 { # this should be a continuation of a multibyte-char die "Bad byte sequence\n" unless ( $width ); $widec .= chr( $b ); } elsif (( $b & 0x80 ) == 0 ) # this is an ascii byte { # cannot occur while assembling a multibyte-char die "Bad byte sequence\n" if ( $width ); $width = 1; $widec = chr( $b ); } else { die "Bad byte value\n"; # all four high-bits set } if ( length( $widec ) == $width ) { $width = 0; $widec = ""; } } die "Incomplete multibyte char\n" if ( $width ); # get here if the string was valid utf8
There is also the plausible chance that a string containing some other multi-byte encoding, such as GB, Big5, KSC, etc, (or just random binary data) might pass this particular utf8 test -- and when treated as utf8 data, it would produce gibberish. Let's hope you don't end up there...
update: I see that you already caught the problem with chromatic's initial suggestion; ++ for that! It also leads me to think that you do have access to perl 5.8, so you can ignore my home-grown utf8-validity check. Naturally, the Encode module will not only validate that the byte sequence is plausible for utf8 data, but will also know whether or not a multibyte sequence maps to a defined code point, which is an important added feature.
|
|---|