in reply to Re^5: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right

That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings.

Strings containing encoded text are still strings, so the second sentence does not back up the claim that is the first sentence.

Also, keep in mind I that databases are just one data source.

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

You are correct that the builtins that expect Unicode (or iso-8859-1 or US-ASCII) strings would mishandle them. That's why decoding is needed.

That said, there are lots of errors in that passage. I covered them below because it's off-topic.

Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used.

It's impossible for the DBDs to determine if conversion is required automatically. They would need to be told, but there's no way to tell them. They guess by creating an instance of the Unicode bug.

Sometimes they leave the string as is (assuming it's already been encoded for the database), sometimes they convert it to utf8 (when it obviously wasn't encoded for the database).

I believe I tested DBD::Pg, DBD::mysql and DBD::sqlite.


This part is off-topic.

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

No, Perl supports two kinds of storage for strings: one that can store 8-bit integers and one that can store larger integers.

If you're dealing with Unicode text, the later must be used if your string contains characters above U+00FF, but it's optional otherwise. That said, it's safer to always use the latter for text to avoid instances of the Unicode bug.

Certain operations (e.g. lc) require that their argument is a string of Unicode characters, but their argument can still use either string format.

The behaviour of operations should not depend on the format of their argument. When this happens, the operation is said to suffer from "the Unicode bug".

Most people (e.g. the OP) need not know any of this. Just decode inputs, encode outputs.