Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

In the code snippet below, why isn't $a marked as utf-8? The problem I am having is that my database expects me to send it valid utf8. If the string I'm sending it has wide characters as $b does, then everything is fine. If I send it the data in $a it gets truncated because of invalid utf8. (The connection to the DB socket is raw I guess, and so no automatic conversion to utf8 is done.)

Must I check everything I write to the DB with Encode::is_utf8 and decode anything that isn't marked as utf8?

I have read all the Perl Unicode related docs but haven't reached enlightenment. Any help understanding this issue is greatly appreciated. (This is perl 5.8.6)

#!/usr/bin/perl use strict; use warnings; use Encode; use Data::Dumper; use Test::More qw(no_plan); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); my $a = "\x{ae}"; my $b = "\x{ae}\x{2122}"; ok( ! Encode::is_utf8($a), '\x{ae} is not marked as utf-8' ); ok( Encode::is_utf8($b), '\x{ae}\x{2122} is marked as utf-8' ); ok( length $a == 1, '\x{ae} character length 1' ); ok( length $b == 2, '\x{ae}\x{2122} character length 2' ); ok( do { use bytes; length $a } == 1, '\x{ae} byte length 1' ); ok( do { use bytes; length $b } == 5, '\x{ae}\x{2122} byte length 5'); chop $b; ok( Encode::is_utf8($b), '$b still utf-8 after chop' ); ok( length $b == 1, '$b character length 1 after chop' ); ok( do { use bytes; length $b } == 2, '$b byte length 2 after chop' ); is( $a, $b, '$a eq $b after utf8 upgrade of $a' ); my $x = decode('iso-8859-1', $a ); ok( Encode::is_utf8($b), '$x is utf-8' ); ok( $x eq $b ); Encode::_utf8_off( $b ); Encode::_utf8_off( $x ); ok( $x eq $b ); is( $a, $b, '$a and $b are byte same' );

Replies are listed 'Best First'.
Re: Why does perl not mark variable as utf-8?
by Joost (Canon) on Aug 29, 2007 at 12:10 UTC
    $a isn't marked as utf-8 because perl has no need to encode it in utf-8.

    Also some of the comments in your tests look suspect:

    is( $a, $b, '$a eq $b after utf8 upgrade of $a' );
    But $a isn't upgraded anywhere.

    ok( Encode::is_utf8($b), '$x is utf-8' );
    You're testing $b, not $x.

    Encode::_utf8_off( $b ); # ... is( $a, $b, '$a and $b are byte same' );
    _utf8_off does not alter the encoding, it just switches the utf-8 flag, so $a and $b should not be the same, since $a is latin-1 and $b is the same character in utf-8 but marked as latin-1.

    That last test is the only one that fails on my machine with perl 5.8.8, and as far as I can see, it's the only one that *should* fail.

    A correct way to ensure a string really is utf-8 encoded (assuming it's already flagged correctly as either utf-8 or latin) is to use utf8::upgrade()

    See also A UTF8 round trip with MySQL - but read the whole thread!

      Joost,

      Thanks for all the help. I understand what you have said and the other node you referenced was really helpful. It's unfortunate that DBD can't call utf8::upgrade() for me as appropriate. That's what I was afraid of and hoping to avoid.

      Do you know if there is an alternative way to write:

      $a = "\x{ae}";
      so that perl stores it as utf8 and it is sent to my database as such? That is, other than calling utf8::upgrade()?

      Again, thanks a million Joost!

        I believe, that if use utf8 is in scope, all literals are marked as utf-8, but note that that only works on string literals. If you're reading strings from anywhere else, you still need to make sure they're upgraded in some other way.

        By the way, the reason DBD::mysql currently doesn't upgrade all input is that it's not immediately clear which data/columns should be utf-8 text (and should be upgraded) and which is non-utf8 text or binary data (and must be left alone).

        Do you know if there is an alternative way to write:
        $a = "\x{ae}";

        Just for comparison, a couple of ways to do and not do it:

        use utf8; use Encode; # for values below 0x100, the chars will always be 8-bit # ("use utf8;" doesn't help here) $s = "ABC\x{ae}XYZ"; info(1, $s); # would've been nice, but doesn't work either $s = "ABC\x{00ae}XYZ"; info(2, $s); # ...same problem with single chars $c = chr(0xae); info(3, $c); $c = chr(0x00ae); info(4, $c); # works for a single char $c = pack("U", 0x00ae); info(5, $c); # ...but gets a little unwieldy for strings $s = pack("U*", unpack("C*", "ABC\x{ae}XYZ")); info(6, $s); # works -- recommended $s = Encode::decode("iso-8859-1", "ABC\x{ae}XYZ"); info(7, $s); # produces a UTF-8 sequence, but with utf8 flag turned off $s = Encode::encode("utf-8", "ABC\x{ae}XYZ"); info(8, $s); # works $s = "ABC\x{ae}XYZ"; utf8::upgrade($s); info(9, $s); # like upgrade(), but with utf8 flag turned off $s = "ABC\x{ae}XYZ"; utf8::encode($s); info(10, $s); # doesn't work (is not supposed to... just for comparison) $s = "ABC\x{ae}XYZ"; utf8::decode($s); info(11, $s); # doesn't work - DON'T EVER DO THAT $s = "ABC\x{ae}XYZ"; Encode::_utf8_on($s); info(12, $s); sub info { my ($n, $s) = @_; printf "%2d: ", $n; print join(" ",unpack("(A2)*", unpack("H*",$s))), # hexdump "\t--> is ", utf8::is_utf8($s) ? "":"not ", "utf8\n"; }

        prints:

        1: 41 42 43 ae 58 59 5a --> is not utf8 2: 41 42 43 ae 58 59 5a --> is not utf8 3: ae --> is not utf8 4: ae --> is not utf8 5: c2 ae --> is utf8 6: 41 42 43 c2 ae 58 59 5a --> is utf8 7: 41 42 43 c2 ae 58 59 5a --> is utf8 8: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 9: 41 42 43 c2 ae 58 59 5a --> is utf8 10: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 11: 41 42 43 ae 58 59 5a --> is not utf8 12: 41 42 43 ae 58 59 5a --> is utf8 # WRONG

        Confused? ;)

        As to using use utf8;, this would work only if you've written your string literals in UTF-8 (not \x{...}), i.e. if you've been using a UTF-8 editor to compose the script...

Re: Why does perl not mark variable as utf-8?
by Gangabass (Vicar) on Aug 29, 2007 at 11:50 UTC

    Try avoid using $a and $b for variable names becourse this is reserved variables in Perl.