Re^5: Mugged by UTF8, this CANNOT be right

Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings. … Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8."

Comment on Re^5: Mugged by UTF8, this CANNOT be right

Replies are listed 'Best First'.
Re^6: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 03:12 UTC
That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings. Strings containing encoded text are still strings, so the second sentence does not back up the claim that is the first sentence. Also, keep in mind I that databases are just one data source. Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). You are correct that the builtins that expect Unicode (or iso-8859-1 or US-ASCII) strings would mishandle them. That's why decoding is needed. That said, there are lots of errors in that passage. I covered them below because it's off-topic. Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. It's impossible for the DBDs to determine if conversion is required automatically. They would need to be told, but there's no way to tell them. They guess by creating an instance of the Unicode bug. Sometimes they leave the string as is (assuming it's already been encoded for the database), sometimes they convert it to utf8 (when it obviously wasn't encoded for the database). I believe I tested DBD::Pg, DBD::mysql and DBD::sqlite. This part is off-topic. Read more... (981 Bytes)	[reply] [d/l]

Replies are listed 'Best First'.

Re^6: Mugged by UTF8, this CANNOT be right
by ikegami (Patriarch) on Jan 27, 2011 at 03:12 UTC

That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings.

Strings containing encoded text are still strings, so the second sentence does not back up the claim that is the first sentence.

Also, keep in mind I that databases are just one data source.

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

You are correct that the builtins that expect Unicode (or iso-8859-1 or US-ASCII) strings would mishandle them. That's why decoding is needed.

That said, there are lots of errors in that passage. I covered them below because it's off-topic.

Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used.

It's impossible for the DBDs to determine if conversion is required automatically. They would need to be told, but there's no way to tell them. They guess by creating an instance of the Unicode bug.

Sometimes they leave the string as is (assuming it's already been encoded for the database), sometimes they convert it to utf8 (when it obviously wasn't encoded for the database).

I believe I tested DBD::Pg, DBD::mysql and DBD::sqlite.

This part is off-topic.