gmpassos has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem.

I need to know by an automatically way, if an string has UTF-8 characters or not!

This will be used to generate a document with a group of strings (Perl SCALAR), that is saved in the disk in binary (bytes from 0 to 255). So, I need to tell in the headers of the file if it's in UTF-8 or not.

In other words, when the amount of strings (SCALAR) are paste to the funtion that build the doc, it needs to detect if it has some UTF-8 automatically.

Graciliano M. P.
"The creativity is the expression of the liberty".

Replies are listed 'Best First'.
Re: How to know if a string has UTF-8?
by graff (Chancellor) on May 28, 2003 at 08:57 UTC
    I know you're trying to be clear, but there is still something missing... Which of the following best matches you're intended question:
    1. Your string contains either plain ascii or utf8, and you need to know which.
    2. Your string contains some sort of non-ascii data, and you need to know whether or not this data constitutes a valid utf8 string.
    Answering the first case is quite simple -- if you already happen to know that any non-ascii data is sure to be utf8 content, then the presence of any bytes with the high bit set is the only diagnosis you need, and the easiest, most portable way to do that would be:
    if ( /[^\x00-\x7f]/ ) # true if $_ contains any non-ascii character
    The second case is trickier: given that a string contains non-ascii data, how would you know whether it's utf8 or something else? Here, the Encode module in Perl 5.8 would provide the best means for solving this -- though I do not agree with chromatic's suggestion (I'll reply to that separately).

    If you are using 5.8, and want to test whether an arbitrary string value contains valid utf8 data, do it like this:

    use Encode; ... # load the string into $_, then: my $test; eval "\$test = decode( 'utf8', \$_, Encode::FB_CROAK )"; if ( $@ ) { # Encode would fail/die if $_ was not a valid utf8 string }
    But in this case, bear in mind that every pure-ascii string constitutes a valid utf8 string -- so the first test mentioned above (testing for non-ascii characters) would still be needed, if you have to know the answer to both questions (1 and 2 above).

    If you don't have Perl 5.8 (hence no Encode module), there may be some means to check for valid utf8 content using tools in 5.6, but I'm not personally familiar with the unicode support in that version.

    The definition of "valid utf8" is of course quite specific, and it wouldn't be hard to roll your own test for it, even with pre-unicode versions of Perl. Basically, in order to qualify as utf8, a string must either be completely ascii (nothing has high-bit set), or else the bytes with high-bit set come in pairs or triplets, and can be checked as follows:

    my @bytes = unpack( "C*", $_ ); # break string into bytes my $widec = ""; # accumulate valid utf8 bytes here my $width = 0; # keep track of how many bytes to accumulate for my $b ( @bytes ) { if (( $b & 0xf0 ) == 0xe0 or # high 4 bits are 1110 ( $b & 0xe0 ) == 0xc0 ) # high 3 bits are 110 { # either condition represents the start of a multibyte-char die "Bad byte sequence\n" if ( $width ); $width = (( $b & 0xe0 ) == 0xe0 ) ? 3 : 2; $widec .= chr( $b ); } elsif (( $b & 0xc0 ) == 0x80 ) # high 2 bits are 10 { # this should be a continuation of a multibyte-char die "Bad byte sequence\n" unless ( $width ); $widec .= chr( $b ); } elsif (( $b & 0x80 ) == 0 ) # this is an ascii byte { # cannot occur while assembling a multibyte-char die "Bad byte sequence\n" if ( $width ); $width = 1; $widec = chr( $b ); } else { die "Bad byte value\n"; # all four high-bits set } if ( length( $widec ) == $width ) { $width = 0; $widec = ""; } } die "Incomplete multibyte char\n" if ( $width ); # get here if the string was valid utf8
    There are probably more elegant or concise ways to lay out that logic, but that's the basic utf8 rule set in a nutshell. (The official Unicode Consortium spec actually covers 32-bit code points rendered in utf8, as well as the common 16-bit code points, and the above logic probably doesn't get that part right -- but your chances of encountering a 32-bit code point in utf8 are pretty much nil, I think.)

    There is also the plausible chance that a string containing some other multi-byte encoding, such as GB, Big5, KSC, etc, (or just random binary data) might pass this particular utf8 test -- and when treated as utf8 data, it would produce gibberish. Let's hope you don't end up there...

    update: I see that you already caught the problem with chromatic's initial suggestion; ++ for that! It also leads me to think that you do have access to perl 5.8, so you can ignore my home-grown utf8-validity check. Naturally, the Encode module will not only validate that the byte sequence is plausible for utf8 data, but will also know whether or not a multibyte sequence maps to a defined code point, which is an important added feature.

Re: How to know if a string has UTF-8?
by chromatic (Archbishop) on May 28, 2003 at 07:33 UTC

    See is_utf8() in Encode, if you're using Perl 5.8.0 or newer.

      Sorry: not only does "is_utf8" miss the mark in this case, but it looks risky based on how it's described in the Encode man page:
      Messing with Perl's internals The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change. is_utf8(STRING [, CHECK]) [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise.
      Some folks get nervous when they hear about "messing with perl's internals"...
      This is only for the flag of the SCALAR, not for the content (or at least need to have the flag utf8 set on to check the content)! Note, I need to check if the content is UTF-8, not if the SCALAR is internally utf8.

      But thanks for the help.

      Graciliano M. P.
      "The creativity is the expression of the liberty".

Re: How to know if a string has UTF-8?
by PodMaster (Abbot) on May 28, 2003 at 16:17 UTC
    Give Encode::Guess a try.


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: How to know if a string has UTF-8?
by fglock (Vicar) on May 28, 2003 at 13:30 UTC

    Hi again, gmpassos :)

    This is what I use: (I didn't write this myself)

    # ValidUTF8 came from: http://people.netscape.com/ftang/utf8/isutf8.pl sub ValidUTF8 { local ( $utf8 ) = pop (@_); if($utf8 =~ /^(([\0-\x7F])|([\xC0-\xDF][\x80-\xBF])|([\xE0-\xE +F][\x80-\xBF][\x80-\xBF])|([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xB +F])|([\xF8-\xFB][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF])|([\xFC- +\xFE][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF]))*$/) { return ! ($utf8 =~ /([\xC0-\xC1])|([\xE0][\x80-\x9F])| +([\xF0][\x80-\x8F])|([\xF8][\x80-\x87])|([\xFC][\x80-\x83])/); } else { return 0; } }
      The advantage of your function is to work with Perl-5.6 too, since it check for group of bytes. But since that's no meaning to check the UTF-8 on Perl without support, since the user won't be able to generate and work with true UTF-8 data, I made that (please check if it's ok):
      sub _is_unicode { if ($] >= 5.8) { eval(q` if ( $data =~ /[\x{100}-\x{10FFFF}]/s) { return 1 ;}} `); } else { ## No Perl support for UTF-8! ;-/ return undef ; } return undef ; }
      I made this based in table at section #Unicode_Encodings in this POD: http://search.cpan.org/author/JHI/perl-5.8.0/pod/perlunicode.pod

      Note that when the user isn't in Perl 5.8 I don't want, actually can't, handle the UTF-8, since the implementation was there only to handle UTF-8 as a group of bytes, not as characters. Note that for Perl 5.7 (that isn't stable) I won't hanbdle UTF-8 too.

      Now I have a big problem. All the computer that I have here are based on ASCII-extended (ISO-8859-1), and for us is just what we need. But is impossible to tell that you made Unicode support without really test it on UTF-8 platforms (EBCDIC), like a Japanese PC. So, any one have some idea, or a Japanese PC?

      Graciliano M. P.
      "The creativity is the expression of the liberty".

        I think the strategy for your "_is_unicode" sub should be fine (I haven't tried it yet, but probably will...) -- it has the nice feature of not requiring the Encode module (important if the script might be run with older versions of Perl), and handles both validation and "presence of non-ascii" questions in one swoop. However, I wonder if the condition inside the eval should be:
        if ( $date =~ /[\x{0080}-\x{FFFF}]/ ) {return 1;}
        Note that the "Latin1" characters (x0080-x00ff) are encoded using two bytes in utf8, and are officially "non-ascii". If you want to pass them through (because your current system can handle Latin1), you still need to convert them to iso-8859-1 (i.e. single-byte encoding) in order for your system to display them correctly -- back to the Encode module...

        (And I still don't think you need to worry about 32-bit code points -- not for a while.)

        As for providing an intelligible view of utf8 data on a single-byte Latin1 system, you could check out Text::Unidecode -- though it might not cover all languages in the manner you would prefer. Another thing to try would be a unicode/utf8-aware browser. There is also a free utility called "yudit" that runs on unix/linux and (I think) ms-windows too -- it's a utf8-aware text editor.