Re^2: Why does perl not mark variable as utf-8?

Replies are listed 'Best First'.
Re^3: Why does perl not mark variable as utf-8? by Joost (Canon) on Aug 29, 2007 at 14:20 UTC
I believe, that if use utf8 is in scope, all literals are marked as utf-8, but note that that only works on string literals. If you're reading strings from anywhere else, you still need to make sure they're upgraded in some other way. By the way, the reason DBD::mysql currently doesn't upgrade all input is that it's not immediately clear which data/columns should be utf-8 text (and should be upgraded) and which is non-utf8 text or binary data (and must be left alone). "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^4: Why does perl not mark variable as utf-8? by Anonymous Monk on Aug 29, 2007 at 14:59 UTC
Is that true? DBD::mysql apparently know what columns to mark on the way out. "When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary." Is that because of column metadata that is provided with the results of a select?	[reply]
Re^5: Why does perl not mark variable as utf-8? by Joost (Canon) on Aug 29, 2007 at 15:14 UTC
Yes, that's exactly it; when retrieving a result set, there's enough metadata to determine weather the columns are utf-8 text or binary, but when executing a query it's hard to figure that out. It may be possible, but I've given it a try and couldn't find a good way to do it in the limited time I had. See also http://rt.cpan.org/Ticket/Display.html?id=25590 "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: Why does perl not mark variable as utf-8? by almut (Canon) on Aug 29, 2007 at 17:13 UTC
Do you know if there is an alternative way to write: `$a = "\x{ae}";` Just for comparison, a couple of ways to do and not do it: use utf8; use Encode; # for values below 0x100, the chars will always be 8-bit # ("use utf8;" doesn't help here) $s = "ABC\x{ae}XYZ"; info(1, $s); # would've been nice, but doesn't work either $s = "ABC\x{00ae}XYZ"; info(2, $s); # ...same problem with single chars $c = chr(0xae); info(3, $c); $c = chr(0x00ae); info(4, $c); # works for a single char $c = pack("U", 0x00ae); info(5, $c); # ...but gets a little unwieldy for strings $s = pack("U", unpack("C", "ABC\x{ae}XYZ")); info(6, $s); # works -- recommended $s = Encode::decode("iso-8859-1", "ABC\x{ae}XYZ"); info(7, $s); # produces a UTF-8 sequence, but with utf8 flag turned off $s = Encode::encode("utf-8", "ABC\x{ae}XYZ"); info(8, $s); # works $s = "ABC\x{ae}XYZ"; utf8::upgrade($s); info(9, $s); # like upgrade(), but with utf8 flag turned off $s = "ABC\x{ae}XYZ"; utf8::encode($s); info(10, $s); # doesn't work (is not supposed to... just for comparison) $s = "ABC\x{ae}XYZ"; utf8::decode($s); info(11, $s); # doesn't work - DON'T EVER DO THAT $s = "ABC\x{ae}XYZ"; Encode::_utf8_on($s); info(12, $s); sub info { my ($n, $s) = @_; printf "%2d: ", $n; print join(" ",unpack("(A2)", unpack("H",$s))), # hexdump "\t--> is ", utf8::is_utf8($s) ? "":"not ", "utf8\n"; } [download] prints: `1: 41 42 43 ae 58 59 5a --> is not utf8 2: 41 42 43 ae 58 59 5a --> is not utf8 3: ae --> is not utf8 4: ae --> is not utf8 5: c2 ae --> is utf8 6: 41 42 43 c2 ae 58 59 5a --> is utf8 7: 41 42 43 c2 ae 58 59 5a --> is utf8 8: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 9: 41 42 43 c2 ae 58 59 5a --> is utf8 10: 41 42 43 c2 ae 58 59 5a --> is not utf8 # wrong 11: 41 42 43 ae 58 59 5a --> is not utf8 12: 41 42 43 ae 58 59 5a --> is utf8 # WRONG` [download] Confused? ;) As to using `use utf8;`, this would work only if you've written your string literals in UTF-8 (not `\x{...}`), i.e. if you've been using a UTF-8 editor to compose the script...	[reply] [d/l] [select]