Re^3: Mugged by UTF8, this CANNOT be right
by tosh (Scribe) on Jan 26, 2011 at 19:52 UTC
|
The problem is that it's a very large application so to break out anything self-contained is not possible.
What I did notice is that everything was working just fine, my Template Toolkit templates have BOMs, my DB is all UTF8 encoded, my charsets were perfect.
Everything worked great, probably because PERL was doing the right thing, but don't forget there's SIX places for UTF8 to get messed up:
1) Template encoding
2) HTTP headers
3) HTML headers
4) DB encoding
5) DB handle
6) The language itself
That's suddenly a lot of room for forgetting one detail that throws everything else off.
With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. But not only does it have to be Encoded, but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.
And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time, but the problem remains that it seems that the only way to be certain is to encode/decode all input and output and that's just not the way things should work, 10% of my programming should not have to be worrying about this issue.
Tosh | [reply] |
|
|
And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time
Perl doesn't guess at encodings, so I don't know to what you are referring.
Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.
Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.
but don't forget there's SIX places for UTF8 to get messed up:
At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:
Inputs:
- Source code: Decoding.
- HTML form data(?): Decoding.
- Database: Decoding.
- Template: Decoding.
Outputs:
- Database (queries and parameters(?)): Encoding.
- HTML response: Encoding and inclusion of Content-Type header.
- HTTP response: Inclusion of Content-Type in header.
- Error log: Encoding.
The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.
With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.
No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.
$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l
+ength; say /^¢\z/ ?1:0' 0
1
1
$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l
+ength; say /^¢\z/ ?1:0' 1
2
0
but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.
Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.
| [reply] [d/l] [select] |
|
|
And there's my point. For web programming there's more checking+encoding+decoding than there is actual programming logic.
Perl is an incredible language, my exclusive language for 15 years now, but when it comes to the global language requirements that most web programming ultimately requires perhaps Perl is presently just not suited to that task.
Maybe I'm just bitchy and irritated right now, but I think of a 10 column X 100 row table of data, that's 1000 data points, and each needs to be checked, decoded, encoded, numerous times. It should not be like this, the future of programming is globally connected and Perl needs to deal with this fact a lot better.
Which leads me to ask: Is there a programming language that easily handles Unicode either automatically or with a flag in the program that when set tells everything else to use Unicode, or is everyone else writing PHP/Python/ASP/Java/etc. as frustrated?
Tosh
| [reply] |
|
|
|
|
|
|
|
Perl doesn't guess at encodings, so I don't know to what you are referring.
Perl does guess. Not "at encodings" perhaps, but tosh didn't say Perl guesses at encodings. Those are your words, ikegami, not tosh's.
The guesswork (or whatever you want to call the legerdemaine) that Perl does is documented in perlunicode:
The "Unicode bug" involves exactly those characters in the Latin-1 Supplement Block (U+0080 through U+00FF) that tosh said "mugged" him.
The problem is that the explanations and workarounds are incomprehensible by mere mortals who just want to write a script to do something simple with modern text (Unicode). It's way too hard to sort out Perl's impenetrable Unicode model.
Evidence that it's way too hard abounds on PerlMonks. Thread after thread about Perl's Unicode support quickly devolve into a debate among the cognoscenti here about how it all works. Even the wizards never seem to agree how to handle Unicode correctly using Perl.
| [reply] |
|
|
|
|
|
|
Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.
That's not what the DBI documentation says. It says, "Most data is returned to the Perl script as strings. … Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8."
| [reply] |
|
|
|
|
| [reply] |
|
|
Perl 5.10.0
mod_perl 2.0.4
Apache 2.2.11
MySQL 5.0.75
DBD::mysql 4.18 (just compiled a couple months ago)
Tosh
| [reply] |
Re^3: Mugged by UTF8, this CANNOT be right
by ikegami (Patriarch) on Jan 26, 2011 at 21:09 UTC
|
I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database.
Yes, sorry, I got it backwards. I was thinking the enable_utf8 affected data sent to the DB, but it affects DB obtained from the DB.
Either way, it's a very incomplete system. Only UTF-8 is supported (right?), and it's broken when it comes to data sent to the DB.
| [reply] [d/l] |