Re^3: A UTF8 round trip with MySQL

My use of encode is solely for OUTPUTting the results to the console.

I noticed that. I just wanted to make it clear to the readers that DBD::mysql does the right thing when retrieving utf-8 data, and the programmer doesn't need to do anything special - provided (s)he makes sure utf-8 marked strings are handled correctly on output.

My problem is with more with using encode_utf8() to output utf-8 text to handles. It's a subtle issue, but since encode_utf8 returns unmarked octets it must be treated as binary data; the string can not safely be used as a text string. For one thing, appending a utf-8 marked string or an 8bit latin-1 string to an unmarked utf-8 string causes (possibly irreversible) mangling.

If you're working with Unicode text, it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers; that way you won't have to worry about which encoding the strings are in while you're working with it.

Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.

I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no?

As far as I know (i haven't tested 4.005 yet) trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters).

There's a fairly recent bug-report on that on rt.cpan.org and it seems that the issue might get fixed so you won't have to manually encode the input strings - dbd-mysql will then do the right thing automatically. (note: rt is often unresponsive - if that link doesn't work, try again a bit later).

A prerequisite for fixing that bug is that DBD-mysql knows what encoding the input strings are actually in, to prevent it from doing the 8bit -> utf-8 transformation twice (right now it blindly assumes they are utf-8). But the only way to tell is to check the utf-8 flag, which encode_utf8() does not set. utf8::upgrade() does more or less the same thing as encode_utf8(), but might be a bit more efficient since it doesn't need to create a new string when the input string is 7bit ASCII (upgrade works in-place) and it set the utf8 flag correctly.

If dbd-mysql would work correctly, utf-8 strings marked as utf-8 will work, and 8bit strings will work too. Unmarked utf-8 strings won't work. Currently only valid utf-8 encoded strings work, regardless of the utf-8 mark. In other words, make sure your strings are correctly marked AND utf-8 encoded. utf8::upgrade() does exactly that.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Comment on Re^3: A UTF8 round trip with MySQL

Replies are listed 'Best First'.
Re^4: A UTF8 round trip with MySQL by clinton (Priest) on Jun 13, 2007 at 11:01 UTC
it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers Good point trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters) I think I understand - is this what you mean: If we have : `$string = 'latin-1 áccented strińg';` The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string. DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped. `utf8::upgrade($string)` turns on the flag DBD::mysql recognises this as UTF-8 and stores it correctly Would using ~~$string = Encode::decode_utf8($string)~~ `$string = Encode::decode('iso-8859-1',$string)` also work in this case? Clint Update - corrected decode	[reply] [d/l] [select]
Re^5: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 11:28 UTC
The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string. Yes, because it's stored in the default 8bit encoding, probably Latin-1. This is assuming you're not using the utf8 pragma, and your script file really is in the default 8bit encoding. DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped. No, dbd::mysql will -currently- assume the string is utf-8 anyway, but since it's actually latin-1 the mysql database will (in my experience) truncate the string at the first accented character. In other words, that value in the database will end up as "latin-1 " utf8::upgrade($string) turns on the flag And it converts the string to utf8 first. At that point you're guaranteed that the internal encoding of $string is really utf-8. utf8::upgrade() is a no-op if the string already is flagged as utf-8, so you can always safely use it when your strings are correctly marked. Would using $string = Encode::decode_utf8($string) also work in this case? No, because the string isn't in utf8 but in the default 8bit encoding. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^6: A UTF8 round trip with MySQL by clinton (Priest) on Jun 13, 2007 at 11:50 UTC
No, because the string isn't in utf8 but in the default 8bit encoding. Sorry, that should have been `$string = Encode::decode('iso-8859-1',$string)` From the Encode docs: the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines). However, that wouldn't work for `$string = 'ńá';` without a preceeding `use utf8;` because it would, by default be stored internally as Latin-1, and here you would need to `utf8::upgrade($string)`. From the perlunicode docs: By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. Have I got this right? Clint	[reply] [d/l] [select]
Re^7: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 12:07 UTC