in reply to Re^3: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right

And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time

Perl doesn't guess at encodings, so I don't know to what you are referring.

Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.

but don't forget there's SIX places for UTF8 to get messed up:

At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:

Inputs:

Outputs:

The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.

With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.

No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.

$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 0 1 1 $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 1 2 0

but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.

Replies are listed 'Best First'.
Re^5: Mugged by UTF8, this CANNOT be right
by tosh (Scribe) on Jan 26, 2011 at 21:26 UTC
    And there's my point. For web programming there's more checking+encoding+decoding than there is actual programming logic.

    Perl is an incredible language, my exclusive language for 15 years now, but when it comes to the global language requirements that most web programming ultimately requires perhaps Perl is presently just not suited to that task.

    Maybe I'm just bitchy and irritated right now, but I think of a 10 column X 100 row table of data, that's 1000 data points, and each needs to be checked, decoded, encoded, numerous times. It should not be like this, the future of programming is globally connected and Perl needs to deal with this fact a lot better.

    Which leads me to ask: Is there a programming language that easily handles Unicode either automatically or with a flag in the program that when set tells everything else to use Unicode, or is everyone else writing PHP/Python/ASP/Java/etc. as frustrated?

    Tosh

      Perl is an incredible language, my exclusive language for 15 years now, but when it comes to the global language requirements that most web programming ultimately requires perhaps Perl is presently just not suited to that task.

      None of those tasks are Perl-specific.

      and each needs to be checked, decoded, encoded, numerous times.

      None need to be "checked". If you don't want to use NULLs, don't use NULLs. If you want to use NULLs, don't complain that you're using NULLs.

      As for your claim that each needs to be decoded and encoded multiple times, it's non-sense. Everything needs to be decoded and encoded exactly once, and that can usually be done automatically.

      Which leads me to ask: Is there a programming language that easily handles Unicode either automatically

      Unicode is not an encoding.

      Your problem has to do with dealing the encodings of various data sources. You have to do that no matter what language unless it places limits on your data sources and on your outputs.

        Which leads me to ask: Is there a programming language that easily handles Unicode
        Unicode is not an encoding.

        Nor does the OP suggest in what you're replying to that it is.

        The answer is no, there is no language that makes it easy. This is the fault of the fucking cretins who decided that having eleventy million different encodings of the same text was a great idea, and that the encoding that should be most common should be really complicated.

Re^5: Mugged by UTF8, this CANNOT be right
by Jim (Curate) on Jan 27, 2011 at 00:37 UTC
    Perl doesn't guess at encodings, so I don't know to what you are referring.

    Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.

    The guesswork (or whatever you want to call the legerdemaine) that Perl does is documented in perlunicode:

    The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.

    The problem is that the explanations and workarounds are incomprehensible by mere mortals who just want to write a script to do something simple with modern text (Unicode). It's way too hard to sort out Perl's impenetrable Unicode model.

    Evidence that it's way too hard abounds on PerlMonks. Thread after thread about Perl's Unicode support quickly devolve into a debate among the cognoscenti here about how it all works. Even the wizards never seem to agree how to handle Unicode correctly using Perl.

      Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.

      I'm well aware of that, but it's the only thing I can see that makes sense to me at the moment. He can correct me if I missed something.

      I did not miss what you brought up.

      When Unicode Does Not Happen

      It's just telling you that opaque strings of bytes from the system are opaque strings. It really goes without saying. You can use it as a list of data sources and outputs that should be added to my bullets if they're used.

      This is an example of Perl doing things right. Modules on CPAN are often unclear as to the format in which they desire strings to be, and that leads to guessing on the programmer's part.

      There is some guessing involved here when the programmer passes garbage, but that would cause it to work correctly when it shouldn't, not the other way around.

      The "Unicode Bug"

      What is known as "the Unicode bug" is the making of decisions based on the internal storage format of a string. It affects what /\w/ matches, for example.

      Switching to working with encoded data would not usually solve this kind of problem. It would make it worse. (e.g. uc("é") would stop working entirely instead of almost always working.)

      And that's assuming he can actually trigger the bug unintentionally.

      Forcing Unicode in Perl (Or Unforcing Unicode in Perl)

      This explains how to work around "the Unicode bug". There's no indication this is necessary.

      The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.

      He didn't specify that he didn't get the problems with characters above that block, so that doesn't rule out mixing encoding strings and decoded strings, a much more likely error, especially in view of his solution.

      The problem is that the explanations and workarounds are incomprehensible by mere mortals

      There's no indication that any workaround is required.

      I think this is a simple case of improper concatenation or interpolation. It's so simple, but it's so common, and so devastating, and there's not much than can be done about it. For example, SQL injection bugs are due to the improper encoding of values into literals. It's up to the coder to know what your strings are and what you can do with them.

      Even the wizards never seem to agree how to handle Unicode correctly using Perl.

      I think the point of debate is the answer to "What is a Perl string?". More specifically, is a string with UTF8=0 necessarilly a string of Unicode characters?

      I think this has officially been resolved as follows:

      No. Perl strings are string of 72-bit integers (or less depending on your build). The meaning of each of those integers (characters) is not limited by Perl, so there is no restriction for them to be limited to Unicode characters. The meaning of the characters of the string will be left to individual functions. For example, lc will consider its argument a string of Unicode characters.

      I think much of the confusion stems from 5.6's failed attempt to support Unicode.

      I'm not saying there's no room for improvement.

      One area in which I would like to see improvement in is in the documentation of functions; they are often unclear as to the format in which strings are expected or returned (e.g. encoded text vs unicode text).

      A improvement in warnings would be even better. There's a ticket concerning this I would love to see come to fruition. It basically adds semantic flags to strings. Something like

      • Unknown
      • String of bytes
      • String of Unicode characters.
      • String of text encoded as per locale.

      This would allow automatic conversion in some instances, and warning of surely incorrect concatenations in others.

      I can't find the ticket at the moment.

Re^5: Mugged by UTF8, this CANNOT be right
by Jim (Curate) on Jan 27, 2011 at 01:10 UTC
    Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

    That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings. … Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8."

      That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings.

      Strings containing encoded text are still strings, so the second sentence does not back up the claim that is the first sentence.

      Also, keep in mind I that databases are just one data source.

      Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

      You are correct that the builtins that expect Unicode (or iso-8859-1 or US-ASCII) strings would mishandle them. That's why decoding is needed.

      That said, there are lots of errors in that passage. I covered them below because it's off-topic.

      Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used.

      It's impossible for the DBDs to determine if conversion is required automatically. They would need to be told, but there's no way to tell them. They guess by creating an instance of the Unicode bug.

      Sometimes they leave the string as is (assuming it's already been encoded for the database), sometimes they convert it to utf8 (when it obviously wasn't encoded for the database).

      I believe I tested DBD::Pg, DBD::mysql and DBD::sqlite.


      This part is off-topic.