in reply to Re^2: A UTF8 round trip with MySQL
in thread A UTF8 round trip with MySQL

My use of encode is solely for OUTPUTting the results to the console.

I noticed that. I just wanted to make it clear to the readers that DBD::mysql does the right thing when retrieving utf-8 data, and the programmer doesn't need to do anything special - provided (s)he makes sure utf-8 marked strings are handled correctly on output.

My problem is with more with using encode_utf8() to output utf-8 text to handles. It's a subtle issue, but since encode_utf8 returns unmarked octets it must be treated as binary data; the string can not safely be used as a text string. For one thing, appending a utf-8 marked string or an 8bit latin-1 string to an unmarked utf-8 string causes (possibly irreversible) mangling.

If you're working with Unicode text, it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers; that way you won't have to worry about which encoding the strings are in while you're working with it.

Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.

I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no?

As far as I know (i haven't tested 4.005 yet) trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters).

There's a fairly recent bug-report on that on rt.cpan.org and it seems that the issue might get fixed so you won't have to manually encode the input strings - dbd-mysql will then do the right thing automatically. (note: rt is often unresponsive - if that link doesn't work, try again a bit later).

A prerequisite for fixing that bug is that DBD-mysql knows what encoding the input strings are actually in, to prevent it from doing the 8bit -> utf-8 transformation twice (right now it blindly assumes they are utf-8). But the only way to tell is to check the utf-8 flag, which encode_utf8() does not set. utf8::upgrade() does more or less the same thing as encode_utf8(), but might be a bit more efficient since it doesn't need to create a new string when the input string is 7bit ASCII (upgrade works in-place) and it set the utf8 flag correctly.

If dbd-mysql would work correctly, utf-8 strings marked as utf-8 will work, and 8bit strings will work too. Unmarked utf-8 strings won't work. Currently only valid utf-8 encoded strings work, regardless of the utf-8 mark. In other words, make sure your strings are correctly marked AND utf-8 encoded. utf8::upgrade() does exactly that.

Replies are listed 'Best First'.
Re^4: A UTF8 round trip with MySQL
by clinton (Priest) on Jun 13, 2007 at 11:01 UTC
    it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers

    Good point

    trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters)

    I think I understand - is this what you mean:

    • If we have : $string = 'latin-1 áccented striñg';
    • The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string.
    • DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped.
    • utf8::upgrade($string) turns on the flag
    • DBD::mysql recognises this as UTF-8 and stores it correctly

    Would using $string = Encode::decode_utf8($string) $string = Encode::decode('iso-8859-1',$string) also work in this case?

    Clint

    Update - corrected decode

      The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string.

      Yes, because it's stored in the default 8bit encoding, probably Latin-1. This is assuming you're not using the utf8 pragma, and your script file really is in the default 8bit encoding.

      DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped.
      No, dbd::mysql will -currently- assume the string is utf-8 anyway, but since it's actually latin-1 the mysql database will (in my experience) truncate the string at the first accented character. In other words, that value in the database will end up as "latin-1 "

      utf8::upgrade($string) turns on the flag
      And it converts the string to utf8 first. At that point you're guaranteed that the internal encoding of $string is really utf-8. utf8::upgrade() is a no-op if the string already is flagged as utf-8, so you can always safely use it when your strings are correctly marked.
      Would using $string = Encode::decode_utf8($string) also work in this case?
      No, because the string isn't in utf8 but in the default 8bit encoding.

        No, because the string isn't in utf8 but in the default 8bit encoding.
        Sorry, that should have been $string = Encode::decode('iso-8859-1',$string)

        From the Encode docs:

        the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).

        However, that wouldn't work for $string = 'ñá'; without a preceeding use utf8; because it would, by default be stored internally as Latin-1, and here you would need to utf8::upgrade($string).

        From the perlunicode docs:

        By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.

        Have I got this right?

        Clint