Re^6: Mugged by UTF8, this CANNOT be right

Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.

I'm well aware of that, but it's the only thing I can see that makes sense to me at the moment. He can correct me if I missed something.

I did not miss what you brought up.

When Unicode Does Not Happen

It's just telling you that opaque strings of bytes from the system are opaque strings. It really goes without saying. You can use it as a list of data sources and outputs that should be added to my bullets if they're used.

This is an example of Perl doing things right. Modules on CPAN are often unclear as to the format in which they desire strings to be, and that leads to guessing on the programmer's part.

There is some guessing involved here when the programmer passes garbage, but that would cause it to work correctly when it shouldn't, not the other way around.

The "Unicode Bug"

What is known as "the Unicode bug" is the making of decisions based on the internal storage format of a string. It affects what /\w/ matches, for example.

Switching to working with encoded data would not usually solve this kind of problem. It would make it worse. (e.g. uc("é") would stop working entirely instead of almost always working.)

And that's assuming he can actually trigger the bug unintentionally.

Forcing Unicode in Perl (Or Unforcing Unicode in Perl)

This explains how to work around "the Unicode bug". There's no indication this is necessary.

The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.

He didn't specify that he didn't get the problems with characters above that block, so that doesn't rule out mixing encoding strings and decoded strings, a much more likely error, especially in view of his solution.

The problem is that the explanations and workarounds are incomprehensible by mere mortals

There's no indication that any workaround is required.

I think this is a simple case of improper concatenation or interpolation. It's so simple, but it's so common, and so devastating, and there's not much than can be done about it. For example, SQL injection bugs are due to the improper encoding of values into literals. It's up to the coder to know what your strings are and what you can do with them.

Even the wizards never seem to agree how to handle Unicode correctly using Perl.

I think the point of debate is the answer to "What is a Perl string?". More specifically, is a string with UTF8=0 necessarilly a string of Unicode characters?

I think this has officially been resolved as follows:

No. Perl strings are string of 72-bit integers (or less depending on your build). The meaning of each of those integers (characters) is not limited by Perl, so there is no restriction for them to be limited to Unicode characters. The meaning of the characters of the string will be left to individual functions. For example, lc will consider its argument a string of Unicode characters.

I think much of the confusion stems from 5.6's failed attempt to support Unicode.

I'm not saying there's no room for improvement.

One area in which I would like to see improvement in is in the documentation of functions; they are often unclear as to the format in which strings are expected or returned (e.g. encoded text vs unicode text).

A improvement in warnings would be even better. There's a ticket concerning this I would love to see come to fruition. It basically adds semantic flags to strings. Something like

Unknown
String of bytes
String of Unicode characters.
String of text encoded as per locale.

This would allow automatic conversion in some instances, and warning of surely incorrect concatenations in others.

I can't find the ticket at the moment.

Comment on Re^6: Mugged by UTF8, this CANNOT be right Select or Download Code