How to know if a string has UTF-8?

gmpassos has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: How to know if a string has UTF-8?
by graff (Chancellor) on May 28, 2003 at 08:57 UTC

Your string contains either plain ascii or utf8, and you need to know which.
Your string contains some sort of non-ascii data, and you need to know whether or not this data constitutes a valid utf8 string.

  if ( /[^\x00-\x7f]/ ) # true if $_ contains any non-ascii character
[download]

chromatic's suggestion

If you are using 5.8, and want to test whether an arbitrary string value contains valid utf8 data, do it like this:

use Encode;

... # load the string into $_, then:

my $test;
eval "\$test = decode( 'utf8', \$_, Encode::FB_CROAK )";
if ( $@ ) {
   # Encode would fail/die if $_ was not a valid utf8 string
}
[download]

If you don't have Perl 5.8 (hence no Encode module), there may be some means to check for valid utf8 content using tools in 5.6, but I'm not personally familiar with the unicode support in that version.

The definition of "valid utf8" is of course quite specific, and it wouldn't be hard to roll your own test for it, even with pre-unicode versions of Perl. Basically, in order to qualify as utf8, a string must either be completely ascii (nothing has high-bit set), or else the bytes with high-bit set come in pairs or triplets, and can be checked as follows:

my @bytes = unpack( "C*", $_ ); # break string into bytes

my $widec = ""; # accumulate valid utf8 bytes here
my $width = 0;  # keep track of how many bytes to accumulate

for my $b ( @bytes ) {
   if (( $b & 0xf0 ) == 0xe0 or  # high 4 bits are 1110 
       ( $b & 0xe0 ) == 0xc0 )   # high 3 bits are 110
   {
# either condition represents the start of a multibyte-char
      die "Bad byte sequence\n" if ( $width );
      $width = (( $b & 0xe0 ) == 0xe0 ) ?  3 : 2;
      $widec .= chr( $b );
   }
   elsif (( $b & 0xc0 ) == 0x80 )  # high 2 bits are 10
   {
# this should be a continuation of a multibyte-char
      die "Bad byte sequence\n" unless ( $width );
      $widec .= chr( $b );
   }
   elsif (( $b & 0x80 ) == 0 )   # this is an ascii byte
   {
# cannot occur while assembling a multibyte-char
      die "Bad byte sequence\n" if ( $width );
      $width = 1;
      $widec = chr( $b );
   }
   else {
      die "Bad byte value\n"; # all four high-bits set
   }
   if ( length( $widec ) == $width ) {
      $width = 0;
      $widec = "";
   }
}
die "Incomplete multibyte char\n" if ( $width );

# get here if the string was valid utf8
[download]

There is also the plausible chance that a string containing some other multi-byte encoding, such as GB, Big5, KSC, etc, (or just random binary data) might pass this particular utf8 test -- and when treated as utf8 data, it would produce gibberish. Let's hope you don't end up there...

update: I see that you already caught the problem with chromatic's initial suggestion; ++ for that! It also leads me to think that you do have access to perl 5.8, so you can ignore my home-grown utf8-validity check. Naturally, the Encode module will not only validate that the byte sequence is plausible for utf8 data, but will also know whether or not a multibyte sequence maps to a defined code point, which is an important added feature.