in reply to Why does perl not mark variable as utf-8?

$a isn't marked as utf-8 because perl has no need to encode it in utf-8.

Also some of the comments in your tests look suspect:

is( $a, $b, '$a eq $b after utf8 upgrade of $a' );
But $a isn't upgraded anywhere.

ok( Encode::is_utf8($b), '$x is utf-8' );
You're testing $b, not $x.

Encode::_utf8_off( $b ); # ... is( $a, $b, '$a and $b are byte same' );
_utf8_off does not alter the encoding, it just switches the utf-8 flag, so $a and $b should not be the same, since $a is latin-1 and $b is the same character in utf-8 but marked as latin-1.

That last test is the only one that fails on my machine with perl 5.8.8, and as far as I can see, it's the only one that *should* fail.

A correct way to ensure a string really is utf-8 encoded (assuming it's already flagged correctly as either utf-8 or latin) is to use utf8::upgrade()

See also A UTF8 round trip with MySQL - but read the whole thread!

Replies are listed 'Best First'.
Re^2: Why does perl not mark variable as utf-8?
by Anonymous Monk on Aug 29, 2007 at 13:51 UTC
    Joost,

    Thanks for all the help. I understand what you have said and the other node you referenced was really helpful. It's unfortunate that DBD can't call utf8::upgrade() for me as appropriate. That's what I was afraid of and hoping to avoid.

    Do you know if there is an alternative way to write:

    $a = "\x{ae}";
    so that perl stores it as utf8 and it is sent to my database as such? That is, other than calling utf8::upgrade()?

    Again, thanks a million Joost!

      I believe, that if use utf8 is in scope, all literals are marked as utf-8, but note that that only works on string literals. If you're reading strings from anywhere else, you still need to make sure they're upgraded in some other way.

      By the way, the reason DBD::mysql currently doesn't upgrade all input is that it's not immediately clear which data/columns should be utf-8 text (and should be upgraded) and which is non-utf8 text or binary data (and must be left alone).

        Is that true? DBD::mysql apparently know what columns to mark on the way out.
        "When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary."
        Is that because of column metadata that is provided with the results of a select?
      Do you know if there is an alternative way to write:
      $a = "\x{ae}";

      Just for comparison, a couple of ways to do and not do it:

      use utf8; use Encode; # for values below 0x100, the chars will always be 8-bit # ("use utf8;" doesn't help here) $s = "ABC\x{ae}XYZ"; info(1, $s); # would've been nice, but doesn't work either $s = "ABC\x{00ae}XYZ"; info(2, $s); # ...same problem with single chars $c = chr(0xae); info(3, $c); $c = chr(0x00ae); info(4, $c); # works for a single char $c = pack("U", 0x00ae); info(5, $c); # ...but gets a little unwieldy for strings $s = pack("U*", unpack("C*", "ABC\x{ae}XYZ")); info(6, $s); # works -- recommended $s = Encode::decode("iso-8859-1", "ABC\x{ae}XYZ"); info(7, $s); # produces a UTF-8 sequence, but with utf8 flag turned off $s = Encode::encode("utf-8", "ABC\x{ae}XYZ"); info(8, $s); # works $s = "ABC\x{ae}XYZ"; utf8::upgrade($s); info(9, $s); # like upgrade(), but with utf8 flag turned off $s = "ABC\x{ae}XYZ"; utf8::encode($s); info(10, $s); # doesn't work (is not supposed to... just for comparison) $s = "ABC\x{ae}XYZ"; utf8::decode($s); info(11, $s); # doesn't work - DON'T EVER DO THAT $s = "ABC\x{ae}XYZ"; Encode::_utf8_on($s); info(12, $s); sub info { my ($n, $s) = @_; printf "%2d: ", $n; print join(" ",unpack("(A2)*", unpack("H*",$s))), # hexdump "\t--> is ", utf8::is_utf8($s) ? "":"not ", "utf8\n"; }

      prints:

      1: 41 42 43 ae 58 59 5a --> is not utf8 2: 41 42 43 ae 58 59 5a --> is not utf8 3: ae --> is not utf8 4: ae --> is not utf8 5: c2 ae --> is utf8 6: 41 42 43 c2 ae 58 59 5a --> is utf8 7: 41 42 43 c2 ae 58 59 5a --> is utf8 8: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 9: 41 42 43 c2 ae 58 59 5a --> is utf8 10: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 11: 41 42 43 ae 58 59 5a --> is not utf8 12: 41 42 43 ae 58 59 5a --> is utf8 # WRONG

      Confused? ;)

      As to using use utf8;, this would work only if you've written your string literals in UTF-8 (not \x{...}), i.e. if you've been using a UTF-8 editor to compose the script...