comment on

That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings.

Strings containing encoded text are still strings, so the second sentence does not back up the claim that is the first sentence.

Also, keep in mind I that databases are just one data source.

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

You are correct that the builtins that expect Unicode (or iso-8859-1 or US-ASCII) strings would mishandle them. That's why decoding is needed.

That said, there are lots of errors in that passage. I covered them below because it's off-topic.

Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used.

It's impossible for the DBDs to determine if conversion is required automatically. They would need to be told, but there's no way to tell them. They guess by creating an instance of the Unicode bug.

Sometimes they leave the string as is (assuming it's already been encoded for the database), sometimes they convert it to utf8 (when it obviously wasn't encoded for the database).

I believe I tested DBD::Pg, DBD::mysql and DBD::sqlite.

This part is off-topic.

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

No, Perl supports two kinds of storage for strings: one that can store 8-bit integers and one that can store larger integers.

If you're dealing with Unicode text, the later must be used if your string contains characters above U+00FF, but it's optional otherwise. That said, it's safer to always use the latter for text to avoid instances of the Unicode bug.

Certain operations (e.g. lc) require that their argument is a string of Unicode characters, but their argument can still use either string format.

The behaviour of operations should not depend on the format of their argument. When this happens, the operation is said to suffer from "the Unicode bug".

Most people (e.g. the OP) need not know any of this. Just decode inputs, encode outputs.

In reply to Re^6: Mugged by UTF8, this CANNOT be right by ikegami
in thread Mugged by UTF8, this CANNOT be right by tosh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.