Re^6: Mugged by UTF8, this CANNOT be right

[ This is an accidental posting of my unfinished post. Please ignore in favour of Re^6: Mugged by UTF8, this CANNOT be right ]

Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.

I'm well aware of that, but it's the only thing I can see that makes sense. He can correct me if I missed something.

I did not miss what you brought up.

When Unicode Does Not Happen

It's just telling you that opaque strings from the system are opaque strings. It really goes without saying. You can use it as a list of data sources and outputs that should be added to my bullets if they're used.

This is an example of Perl doing things right. Modules on CPAN are often unclear as to the format in which they desire strings to be.

You'll get garbage if you don't properly encode or decode the string. That's neither surprising nor unexpected.

The "Unicode Bug"

What is known as "the Unicode bug" is the making of decisions based on the internal storage format of a string. It affects what /\w/ matches, for example.

Switching to working with encoded data would not usually solve this kind of problem. It would make it worse. (e.g. uc("é") would stop working entirely instead of almost always working.)

And that's assuming he can actually trigger the bug unintentionally.

Forcing Unicode in Perl (Or Unforcing Unicode in Perl)

This explains how to work around "the Unicode bug". There's no indication this is necessary.

The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.

He didn't specify that he didn't get the problems with characters above that block, so that doesn't rule out mixing encoding strings and decoded strings, a much more likely error, especially in view of his solution.

The problem is that the explanations and workarounds are incomprehensible by mere mortals

There's no indication that any workaround is required.

I think this is a simple case of improper concatenation or interpolation. It's so simple, but it's so common, and so devastating, and there's not much than can be done about it. For example, SQL injection bugs are due to the improper encoding of values into literals. It's up to the coder to know what your strings are and what you can do with them.

I'm not saying there's no room for improvement. One area I would like to see improvement in is in the documentation of functions; they are often unclear as to the format in which strings are expected or returned (e.g. encoded text vs unicode text).

Comment on Re^6: Mugged by UTF8, this CANNOT be right Download Code