in reply to Re: Why does perl not mark variable as utf-8?
in thread Why does perl not mark variable as utf-8?

Joost,

Thanks for all the help. I understand what you have said and the other node you referenced was really helpful. It's unfortunate that DBD can't call utf8::upgrade() for me as appropriate. That's what I was afraid of and hoping to avoid.

Do you know if there is an alternative way to write:

$a = "\x{ae}";
so that perl stores it as utf8 and it is sent to my database as such? That is, other than calling utf8::upgrade()?

Again, thanks a million Joost!

Replies are listed 'Best First'.
Re^3: Why does perl not mark variable as utf-8?
by Joost (Canon) on Aug 29, 2007 at 14:20 UTC
    I believe, that if use utf8 is in scope, all literals are marked as utf-8, but note that that only works on string literals. If you're reading strings from anywhere else, you still need to make sure they're upgraded in some other way.

    By the way, the reason DBD::mysql currently doesn't upgrade all input is that it's not immediately clear which data/columns should be utf-8 text (and should be upgraded) and which is non-utf8 text or binary data (and must be left alone).

      Is that true? DBD::mysql apparently know what columns to mark on the way out.
      "When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary."
      Is that because of column metadata that is provided with the results of a select?
Re^3: Why does perl not mark variable as utf-8?
by almut (Canon) on Aug 29, 2007 at 17:13 UTC
    Do you know if there is an alternative way to write:
    $a = "\x{ae}";

    Just for comparison, a couple of ways to do and not do it:

    use utf8; use Encode; # for values below 0x100, the chars will always be 8-bit # ("use utf8;" doesn't help here) $s = "ABC\x{ae}XYZ"; info(1, $s); # would've been nice, but doesn't work either $s = "ABC\x{00ae}XYZ"; info(2, $s); # ...same problem with single chars $c = chr(0xae); info(3, $c); $c = chr(0x00ae); info(4, $c); # works for a single char $c = pack("U", 0x00ae); info(5, $c); # ...but gets a little unwieldy for strings $s = pack("U*", unpack("C*", "ABC\x{ae}XYZ")); info(6, $s); # works -- recommended $s = Encode::decode("iso-8859-1", "ABC\x{ae}XYZ"); info(7, $s); # produces a UTF-8 sequence, but with utf8 flag turned off $s = Encode::encode("utf-8", "ABC\x{ae}XYZ"); info(8, $s); # works $s = "ABC\x{ae}XYZ"; utf8::upgrade($s); info(9, $s); # like upgrade(), but with utf8 flag turned off $s = "ABC\x{ae}XYZ"; utf8::encode($s); info(10, $s); # doesn't work (is not supposed to... just for comparison) $s = "ABC\x{ae}XYZ"; utf8::decode($s); info(11, $s); # doesn't work - DON'T EVER DO THAT $s = "ABC\x{ae}XYZ"; Encode::_utf8_on($s); info(12, $s); sub info { my ($n, $s) = @_; printf "%2d: ", $n; print join(" ",unpack("(A2)*", unpack("H*",$s))), # hexdump "\t--> is ", utf8::is_utf8($s) ? "":"not ", "utf8\n"; }

    prints:

    1: 41 42 43 ae 58 59 5a --> is not utf8 2: 41 42 43 ae 58 59 5a --> is not utf8 3: ae --> is not utf8 4: ae --> is not utf8 5: c2 ae --> is utf8 6: 41 42 43 c2 ae 58 59 5a --> is utf8 7: 41 42 43 c2 ae 58 59 5a --> is utf8 8: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 9: 41 42 43 c2 ae 58 59 5a --> is utf8 10: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 11: 41 42 43 ae 58 59 5a --> is not utf8 12: 41 42 43 ae 58 59 5a --> is utf8 # WRONG

    Confused? ;)

    As to using use utf8;, this would work only if you've written your string literals in UTF-8 (not \x{...}), i.e. if you've been using a UTF-8 editor to compose the script...