in reply to Re^3: A UTF8 round trip with MySQL
in thread A UTF8 round trip with MySQL

it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers

Good point

trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters)

I think I understand - is this what you mean:

Would using $string = Encode::decode_utf8($string) $string = Encode::decode('iso-8859-1',$string) also work in this case?

Clint

Update - corrected decode

Replies are listed 'Best First'.
Re^5: A UTF8 round trip with MySQL
by Joost (Canon) on Jun 13, 2007 at 11:28 UTC
    The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string.

    Yes, because it's stored in the default 8bit encoding, probably Latin-1. This is assuming you're not using the utf8 pragma, and your script file really is in the default 8bit encoding.

    DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped.
    No, dbd::mysql will -currently- assume the string is utf-8 anyway, but since it's actually latin-1 the mysql database will (in my experience) truncate the string at the first accented character. In other words, that value in the database will end up as "latin-1 "

    utf8::upgrade($string) turns on the flag
    And it converts the string to utf8 first. At that point you're guaranteed that the internal encoding of $string is really utf-8. utf8::upgrade() is a no-op if the string already is flagged as utf-8, so you can always safely use it when your strings are correctly marked.
    Would using $string = Encode::decode_utf8($string) also work in this case?
    No, because the string isn't in utf8 but in the default 8bit encoding.

      No, because the string isn't in utf8 but in the default 8bit encoding.
      Sorry, that should have been $string = Encode::decode('iso-8859-1',$string)

      From the Encode docs:

      the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).

      However, that wouldn't work for $string = 'ñá'; without a preceeding use utf8; because it would, by default be stored internally as Latin-1, and here you would need to utf8::upgrade($string).

      From the perlunicode docs:

      By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.

      Have I got this right?

      Clint
        If your script is in latin-1, decode('iso-8859-1',$string) will work too. As far as I know decode() will always upgrade to utf8 (or ascii, which is a byte-compatible subset of utf-8)

        If $string = 'ñá'; is a literal in a utf-8 encoded script, you should use the utf8 pragma to set the utf-8 markers correctly on literals. And then decode('iso-8859-1') probably won't work correctly on it. But utf8::upgrade() will still work.

        By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.
        I don't know what "Unicode strings are downgraded with UTF-8 encoding" means. Also the line below that paragraph in perlunicode says

        If you wish to interpret byte strings as UTF-8 instead, use the "encod +ing" pragma: use encoding 'utf8';

        Don't believe it. You should use utf8; instead. use encoding 'utf8' is broken.