Re: How to know if a string has UTF-8?

I know you're trying to be clear, but there is still something missing... Which of the following best matches you're intended question:

Your string contains either plain ascii or utf8, and you need to know which.
Your string contains some sort of non-ascii data, and you need to know whether or not this data constitutes a valid utf8 string.

Answering the first case is quite simple -- if you already happen to know that any non-ascii data is sure to be utf8 content, then the presence of any bytes with the high bit set is the only diagnosis you need, and the easiest, most portable way to do that would be:

  if ( /[^\x00-\x7f]/ ) # true if $_ contains any non-ascii character
[download]

The second case is trickier: given that a string contains non-ascii data, how would you know whether it's utf8 or something else? Here, the Encode module in Perl 5.8 would provide the best means for solving this -- though I do not agree with chromatic's suggestion (I'll reply to that separately).

If you are using 5.8, and want to test whether an arbitrary string value contains valid utf8 data, do it like this:

use Encode;

... # load the string into $_, then:

my $test;
eval "\$test = decode( 'utf8', \$_, Encode::FB_CROAK )";
if ( $@ ) {
   # Encode would fail/die if $_ was not a valid utf8 string
}
[download]

But in this case, bear in mind that every pure-ascii string constitutes a valid utf8 string -- so the first test mentioned above (testing for non-ascii characters) would still be needed, if you have to know the answer to both questions (1 and 2 above).

If you don't have Perl 5.8 (hence no Encode module), there may be some means to check for valid utf8 content using tools in 5.6, but I'm not personally familiar with the unicode support in that version.

The definition of "valid utf8" is of course quite specific, and it wouldn't be hard to roll your own test for it, even with pre-unicode versions of Perl. Basically, in order to qualify as utf8, a string must either be completely ascii (nothing has high-bit set), or else the bytes with high-bit set come in pairs or triplets, and can be checked as follows:

my @bytes = unpack( "C*", $_ ); # break string into bytes

my $widec = ""; # accumulate valid utf8 bytes here
my $width = 0;  # keep track of how many bytes to accumulate

for my $b ( @bytes ) {
   if (( $b & 0xf0 ) == 0xe0 or  # high 4 bits are 1110 
       ( $b & 0xe0 ) == 0xc0 )   # high 3 bits are 110
   {
# either condition represents the start of a multibyte-char
      die "Bad byte sequence\n" if ( $width );
      $width = (( $b & 0xe0 ) == 0xe0 ) ?  3 : 2;
      $widec .= chr( $b );
   }
   elsif (( $b & 0xc0 ) == 0x80 )  # high 2 bits are 10
   {
# this should be a continuation of a multibyte-char
      die "Bad byte sequence\n" unless ( $width );
      $widec .= chr( $b );
   }
   elsif (( $b & 0x80 ) == 0 )   # this is an ascii byte
   {
# cannot occur while assembling a multibyte-char
      die "Bad byte sequence\n" if ( $width );
      $width = 1;
      $widec = chr( $b );
   }
   else {
      die "Bad byte value\n"; # all four high-bits set
   }
   if ( length( $widec ) == $width ) {
      $width = 0;
      $widec = "";
   }
}
die "Incomplete multibyte char\n" if ( $width );

# get here if the string was valid utf8
[download]

There are probably more elegant or concise ways to lay out that logic, but that's the basic utf8 rule set in a nutshell. (The official Unicode Consortium spec actually covers 32-bit code points rendered in utf8, as well as the common 16-bit code points, and the above logic probably doesn't get that part right -- but your chances of encountering a 32-bit code point in utf8 are pretty much nil, I think.)

There is also the plausible chance that a string containing some other multi-byte encoding, such as GB, Big5, KSC, etc, (or just random binary data) might pass this particular utf8 test -- and when treated as utf8 data, it would produce gibberish. Let's hope you don't end up there...

update: I see that you already caught the problem with chromatic's initial suggestion; ++ for that! It also leads me to think that you do have access to perl 5.8, so you can ignore my home-grown utf8-validity check. Naturally, the Encode module will not only validate that the byte sequence is plausible for utf8 data, but will also know whether or not a multibyte sequence maps to a defined code point, which is an important added feature.

Comment on Re: How to know if a string has UTF-8? Select or Download Code