Re^6: A UTF8 round trip with MySQL

No, because the string isn't in utf8 but in the default 8bit encoding.

Sorry, that should have been $string = Encode::decode('iso-8859-1',$string)

From the Encode docs:

the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).

However, that wouldn't work for $string = 'ńá'; without a preceeding use utf8; because it would, by default be stored internally as Latin-1, and here you would need to utf8::upgrade($string).

From the perlunicode docs:

By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.

Have I got this right?

Clint

Comment on Re^6: A UTF8 round trip with MySQL Select or Download Code

Replies are listed 'Best First'.
Re^7: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 12:07 UTC
If your script is in latin-1, `decode('iso-8859-1',$string)` will work too. As far as I know decode() will always upgrade to utf8 (or ascii, which is a byte-compatible subset of utf-8) If `$string = 'ńá';` is a literal in a utf-8 encoded script, you should use the utf8 pragma to set the utf-8 markers correctly on literals. And then decode('iso-8859-1') probably won't work correctly on it. But utf8::upgrade() will still work. By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. I don't know what "Unicode strings are downgraded with UTF-8 encoding" means. Also the line below that paragraph in perlunicode says `If you wish to interpret byte strings as UTF-8 instead, use the "encod +ing" pragma: use encoding 'utf8';` [download] Don't believe it. You should use utf8; instead. `use encoding 'utf8'` is broken. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^7: A UTF8 round trip with MySQL
by Joost (Canon) on Jun 13, 2007 at 12:07 UTC

decode('iso-8859-1',$string)

If $string = 'ńá'; is a literal in a utf-8 encoded script, you should use the utf8 pragma to set the utf-8 markers correctly on literals. And then decode('iso-8859-1') probably won't work correctly on it. But utf8::upgrade() will still work.

By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.

If you wish to interpret byte strings as UTF-8 instead, use the "encod
+ing" pragma:

      use encoding 'utf8';
[download]

Don't believe it. You should use utf8; instead. use encoding 'utf8' is broken.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

[reply]
[d/l]
[select]