in reply to Re^2: Mugged by UTF8, this CANNOT be right
in thread Mugged by UTF8, this CANNOT be right

The problem is that it's a very large application so to break out anything self-contained is not possible.

What I did notice is that everything was working just fine, my Template Toolkit templates have BOMs, my DB is all UTF8 encoded, my charsets were perfect.

Everything worked great, probably because PERL was doing the right thing, but don't forget there's SIX places for UTF8 to get messed up:
1) Template encoding
2) HTTP headers
3) HTML headers
4) DB encoding
5) DB handle
6) The language itself

That's suddenly a lot of room for forgetting one detail that throws everything else off.

With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. But not only does it have to be Encoded, but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time, but the problem remains that it seems that the only way to be certain is to encode/decode all input and output and that's just not the way things should work, 10% of my programming should not have to be worrying about this issue.

Tosh
  • Comment on Re^3: Mugged by UTF8, this CANNOT be right

Replies are listed 'Best First'.
Re^4: Mugged by UTF8, this CANNOT be right
by ikegami (Patriarch) on Jan 26, 2011 at 21:01 UTC

    And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time

    Perl doesn't guess at encodings, so I don't know to what you are referring.

    Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

    Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.

    but don't forget there's SIX places for UTF8 to get messed up:

    At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:

    Inputs:

    • Source code: Decoding.
    • HTML form data(?): Decoding.
    • Database: Decoding.
    • Template: Decoding.

    Outputs:

    • Database (queries and parameters(?)): Encoding.
    • HTML response: Encoding and inclusion of Content-Type header.
    • HTTP response: Inclusion of Content-Type in header.
    • Error log: Encoding.

    The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.

    With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.

    No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.

    $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 0 1 1 $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 1 2 0

    but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

    Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.

      And there's my point. For web programming there's more checking+encoding+decoding than there is actual programming logic.

      Perl is an incredible language, my exclusive language for 15 years now, but when it comes to the global language requirements that most web programming ultimately requires perhaps Perl is presently just not suited to that task.

      Maybe I'm just bitchy and irritated right now, but I think of a 10 column X 100 row table of data, that's 1000 data points, and each needs to be checked, decoded, encoded, numerous times. It should not be like this, the future of programming is globally connected and Perl needs to deal with this fact a lot better.

      Which leads me to ask: Is there a programming language that easily handles Unicode either automatically or with a flag in the program that when set tells everything else to use Unicode, or is everyone else writing PHP/Python/ASP/Java/etc. as frustrated?

      Tosh

        Perl is an incredible language, my exclusive language for 15 years now, but when it comes to the global language requirements that most web programming ultimately requires perhaps Perl is presently just not suited to that task.

        None of those tasks are Perl-specific.

        and each needs to be checked, decoded, encoded, numerous times.

        None need to be "checked". If you don't want to use NULLs, don't use NULLs. If you want to use NULLs, don't complain that you're using NULLs.

        As for your claim that each needs to be decoded and encoded multiple times, it's non-sense. Everything needs to be decoded and encoded exactly once, and that can usually be done automatically.

        Which leads me to ask: Is there a programming language that easily handles Unicode either automatically

        Unicode is not an encoding.

        Your problem has to do with dealing the encodings of various data sources. You have to do that no matter what language unless it places limits on your data sources and on your outputs.

      Perl doesn't guess at encodings, so I don't know to what you are referring.

      Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.

      The guesswork (or whatever you want to call the legerdemaine) that Perl does is documented in perlunicode:

      The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.

      The problem is that the explanations and workarounds are incomprehensible by mere mortals who just want to write a script to do something simple with modern text (Unicode). It's way too hard to sort out Perl's impenetrable Unicode model.

      Evidence that it's way too hard abounds on PerlMonks. Thread after thread about Perl's Unicode support quickly devolve into a debate among the cognoscenti here about how it all works. Even the wizards never seem to agree how to handle Unicode correctly using Perl.

        Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.

        I'm well aware of that, but it's the only thing I can see that makes sense to me at the moment. He can correct me if I missed something.

        I did not miss what you brought up.

        When Unicode Does Not Happen

        It's just telling you that opaque strings of bytes from the system are opaque strings. It really goes without saying. You can use it as a list of data sources and outputs that should be added to my bullets if they're used.

        This is an example of Perl doing things right. Modules on CPAN are often unclear as to the format in which they desire strings to be, and that leads to guessing on the programmer's part.

        There is some guessing involved here when the programmer passes garbage, but that would cause it to work correctly when it shouldn't, not the other way around.

        The "Unicode Bug"

        What is known as "the Unicode bug" is the making of decisions based on the internal storage format of a string. It affects what /\w/ matches, for example.

        Switching to working with encoded data would not usually solve this kind of problem. It would make it worse. (e.g. uc("é") would stop working entirely instead of almost always working.)

        And that's assuming he can actually trigger the bug unintentionally.

        Forcing Unicode in Perl (Or Unforcing Unicode in Perl)

        This explains how to work around "the Unicode bug". There's no indication this is necessary.

        The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.

        He didn't specify that he didn't get the problems with characters above that block, so that doesn't rule out mixing encoding strings and decoded strings, a much more likely error, especially in view of his solution.

        The problem is that the explanations and workarounds are incomprehensible by mere mortals

        There's no indication that any workaround is required.

        I think this is a simple case of improper concatenation or interpolation. It's so simple, but it's so common, and so devastating, and there's not much than can be done about it. For example, SQL injection bugs are due to the improper encoding of values into literals. It's up to the coder to know what your strings are and what you can do with them.

        Even the wizards never seem to agree how to handle Unicode correctly using Perl.

        I think the point of debate is the answer to "What is a Perl string?". More specifically, is a string with UTF8=0 necessarilly a string of Unicode characters?

        I think this has officially been resolved as follows:

        No. Perl strings are string of 72-bit integers (or less depending on your build). The meaning of each of those integers (characters) is not limited by Perl, so there is no restriction for them to be limited to Unicode characters. The meaning of the characters of the string will be left to individual functions. For example, lc will consider its argument a string of Unicode characters.

        I think much of the confusion stems from 5.6's failed attempt to support Unicode.

        I'm not saying there's no room for improvement.

        One area in which I would like to see improvement in is in the documentation of functions; they are often unclear as to the format in which strings are expected or returned (e.g. encoded text vs unicode text).

        A improvement in warnings would be even better. There's a ticket concerning this I would love to see come to fruition. It basically adds semantic flags to strings. Something like

        • Unknown
        • String of bytes
        • String of Unicode characters.
        • String of text encoded as per locale.

        This would allow automatic conversion in some instances, and warning of surely incorrect concatenations in others.

        I can't find the ticket at the moment.

      Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

      That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings. … Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8."

        That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings.

        Strings containing encoded text are still strings, so the second sentence does not back up the claim that is the first sentence.

        Also, keep in mind I that databases are just one data source.

        Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding).

        You are correct that the builtins that expect Unicode (or iso-8859-1 or US-ASCII) strings would mishandle them. That's why decoding is needed.

        That said, there are lots of errors in that passage. I covered them below because it's off-topic.

        Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used.

        It's impossible for the DBDs to determine if conversion is required automatically. They would need to be told, but there's no way to tell them. They guess by creating an instance of the Unicode bug.

        Sometimes they leave the string as is (assuming it's already been encoded for the database), sometimes they convert it to utf8 (when it obviously wasn't encoded for the database).

        I believe I tested DBD::Pg, DBD::mysql and DBD::sqlite.


        This part is off-topic.

Re^4: Mugged by UTF8, this CANNOT be right
by FalseVinylShrub (Chaplain) on Jan 26, 2011 at 20:19 UTC

    Hi

    What version of Perl are you using?

    FalseVinylShrub

    Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

      Perl 5.10.0
      mod_perl 2.0.4
      Apache 2.2.11
      MySQL 5.0.75
      DBD::mysql 4.18 (just compiled a couple months ago)

      Tosh