Mugged by UTF8, this CANNOT be right

tosh has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Mugged by UTF8, this CANNOT be right by zentara (Cardinal) on Jan 26, 2011 at 19:47 UTC
Try reading A UTF8 round trip with MySQL I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^2: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 27, 2011 at 01:01 UTC
This thread is yet another one of countless similar PerlMonks threads about Perl's Unicode support that immediately devolves into a discussion (debate) about how it all works, how it doesn't work, what the bugs are, what worked for him when he tried foo, what worked for her when she tried bar, and so on and so forth. I didn't study the thread, but I read enough of it to make my head explode. I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks.	[reply]
Re^3: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 03:28 UTC
The reason is that the DBDs are buggy, but this wasn't mentioned clearly, causing confusion. I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks. Decode inputs. Encode outputs. This will leave you only the bugs, and one can usually work around them using utf8::upgrade or utf8::downgrade. The catch is that there are LOTS of inputs and outputs to a program. It would be nice to be warned when one is missed. This would require languages to have different types for encoded and decoded strings. That flags I mentioned earlier would achieve this.	[reply]
Re^4: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 27, 2011 at 07:38 UTC
Re^5: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 10:30 UTC
Some notes below your chosen depth have not been shown here
Re^5: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 08:44 UTC
Some notes below your chosen depth have not been shown here
Re: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 26, 2011 at 18:13 UTC
Encode inputs? Inputs should be decoded (automatically or otherwise). Speaking of being new to "accents"... Are accents a relatively new thing for PERL programmers? (The language is "Perl".) They are relatively new to Perl itself. Perl has been along long before Unicode. Lots of code predates Unicode support, so there are backwards compatibility issues. I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?	[reply]
Re^2: Mugged by UTF8, this CANNOT be right by FalseVinylShrub (Chaplain) on Jan 26, 2011 at 19:30 UTC
Hi ikegami said: was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text? I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database. There have been proposals to add encoding support at the DBI level, but I've not heard about them being released yet. As for why it's not doing what is desired for the O.P... Perhaps a full, self-contained test program would help? Regards FalseVinylShrub Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.	[reply]
Re^3: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 26, 2011 at 19:52 UTC
The problem is that it's a very large application so to break out anything self-contained is not possible. What I did notice is that everything was working just fine, my Template Toolkit templates have BOMs, my DB is all UTF8 encoded, my charsets were perfect. Everything worked great, probably because PERL was doing the right thing, but don't forget there's SIX places for UTF8 to get messed up: 1) Template encoding 2) HTTP headers 3) HTML headers 4) DB encoding 5) DB handle 6) The language itself That's suddenly a lot of room for forgetting one detail that throws everything else off. With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. But not only does it have to be Encoded, but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels. And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time, but the problem remains that it seems that the only way to be certain is to encode/decode all input and output and that's just not the way things should work, 10% of my programming should not have to be worrying about this issue. Tosh	[reply]
Re^4: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 26, 2011 at 21:01 UTC
Re^5: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 26, 2011 at 21:26 UTC
Some notes below your chosen depth have not been shown here
Re^5: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 27, 2011 at 00:37 UTC
Some notes below your chosen depth have not been shown here
Re^5: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 27, 2011 at 01:10 UTC
Some notes below your chosen depth have not been shown here
Re^4: Mugged by UTF8, this CANNOT be right by FalseVinylShrub (Chaplain) on Jan 26, 2011 at 20:19 UTC
Re^5: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 26, 2011 at 20:50 UTC
Re^3: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 26, 2011 at 21:09 UTC
I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database. Yes, sorry, I got it backwards. I was thinking the `enable_utf8` affected data sent to the DB, but it affects DB obtained from the DB. Either way, it's a very incomplete system. Only UTF-8 is supported (right?), and it's broken when it comes to data sent to the DB.	[reply] [d/l]
Re^2: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 10:16 UTC
I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text? A number of DBD's dencode the data returned from the database including DBD::ODBC (in a unicode build of it or when instructed to with handle flags), DBD::Oracle, DBD::Pg and DBD::mysql. There may be others.	[reply]
Re^3: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 16:54 UTC
I corrected myself already, but it's not nearly as functional as you make it sound. See the linked post.	[reply]
Re^4: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 17:34 UTC
Re^5: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 17:37 UTC
Some notes below your chosen depth have not been shown here
Re: Mugged by UTF8, this CANNOT be right by nif (Sexton) on Jan 26, 2011 at 21:29 UTC
Just tried to use "`$sth->fetchall_arrayref`" with UNICODE in mysql table - with "`$dbh->{mysql_enable_utf8}=1`" it always returns right encoded strings - so no need to encode/decode yourself. Test details in `<readmore>` section below... And about "foreign accents" - a lot of Monks, who give you answers on this site, do not speak English at home. Read more... (3 kB)	[reply] [d/l] [select]
Re^2: Mugged by UTF8, this CANNOT be right by rowdog (Curate) on Jan 27, 2011 at 03:59 UTC
- a lot of Monks, who give you answers on this site, do not speak English at home. Especially the monks who know how to answer questions about Unicode! Edit: ~~that's a bass ackwards way to extract the minutes anyhow, man date shows that `date +'%M'` would have sufficed.~~ Oops, this is the wrong post for that totally irrelevant edit.	[reply] [d/l]
Re: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 28, 2011 at 09:14 UTC
So on the lighter side of things... What do you Monks think would be easier? 1) A simple solution to this issue? or 2) Convince the world to give up their accents and adopt ASCII ;) At least now I don't feel like too much of a moron for not being able to figure this out. Tosh	[reply]
Re^2: Mugged by UTF8, this CANNOT be right by Anonyrnous Monk (Hermit) on Jan 28, 2011 at 09:55 UTC
Probably 2), as there is no simple solution to the issue ;)	[reply]
Re^2: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 28, 2011 at 18:57 UTC
You didn't provide any information on the bug (just your workaround for it). If you want us to help resolve your problem, you'll need to at least describe it to us. For starters, what output do you get and what output do you expect? It's best if you avoid "non-ASCII" characters by using tools that escape these (Data::Dumper with `$Data::Dumper::Useqq=1;`, `od` command line tool, etc)	[reply] [d/l] [select]
Re^2: Mugged by UTF8, this CANNOT be right by Anonymous Monk on Jan 28, 2011 at 09:57 UTC
I would start with a simple description of the issue. Pretend I've read this thread twice and I'm still confused, then explain it like you're reporting a bug and I'm 2yo :)	[reply]
Re: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 30, 2011 at 06:47 UTC
Tosh, Compare your Perl Unicode rant with mine from a few months ago. (I'd forgotten about this discussion thread. I just stumbled upon it tonight.) Jim	[reply]
Re^2: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 30, 2011 at 23:38 UTC
Ha ha! ++ for using "Chicanery"!	[reply]