tosh has asked for the wisdom of the Perl Monks concerning the following question:

So I have been walking home from the bar the same way for the past 15 years, and last night I heard some voices with a strange accent in the dark, definitely some kind of foreigners, I heard äöÜÖëãñÑÕß and who knows what else, but then they certainly spoke English when they demanded I hand over my wallet.

So now I join all of you who have also been mugged by foreign accents and I realize that something MUST BE DONE about this menace!!

Seriously, I used to be able to walk home like this:
return $sth->fetchall_arrayref({ });
But now I have to take a ridiculous detour like this if I want to be safe:
my $rows = $sth->fetchall_arrayref({ }); for my $row (@{$rows}) { foreach my $key (keys %{$row}) { if ($row->{$key}) { $row->{$key} = Encode::encode_utf8( $row->{$key} ); } } } return $rows;
This CANNOT be the right way to deal with this problem, we are using PERL, you know, do the right thing?!

Every piece of data that comes and goes CANNOT have to be processed, I just can't accept that this is how things have to be now.

And yes, my MySQL handle uses mysql_enable_utf8 => 1.

How can it be like this? Are accents a relatively new thing for PERL programmers?

Vexed.

Tosh

Replies are listed 'Best First'.
Re: Mugged by UTF8, this CANNOT be right
by zentara (Cardinal) on Jan 26, 2011 at 19:47 UTC

      This thread is yet another one of countless similar PerlMonks threads about Perl's Unicode support that immediately devolves into a discussion (debate) about how it all works, how it doesn't work, what the bugs are, what worked for him when he tried foo, what worked for her when she tried bar, and so on and so forth.

      I didn't study the thread, but I read enough of it to make my head explode.

      I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks.

        The reason is that the DBDs are buggy, but this wasn't mentioned clearly, causing confusion.

        I haven't yet found any explanation of Perl's Unicode model I can understand and use to write Perl programs that handle Unicode text consistently and reliably in any of the Perl documentation or in any discussion threads or tutorials on PerlMonks.

        Decode inputs. Encode outputs. This will leave you only the bugs, and one can usually work around them using utf8::upgrade or utf8::downgrade.

        The catch is that there are LOTS of inputs and outputs to a program. It would be nice to be warned when one is missed. This would require languages to have different types for encoded and decoded strings. That flags I mentioned earlier would achieve this.

Re: Mugged by UTF8, this CANNOT be right
by ikegami (Patriarch) on Jan 26, 2011 at 18:13 UTC

    Encode inputs? Inputs should be decoded (automatically or otherwise). Speaking of being new to "accents"...

    Are accents a relatively new thing for PERL programmers?

    (The language is "Perl".)

    They are relatively new to Perl itself. Perl has been along long before Unicode. Lots of code predates Unicode support, so there are backwards compatibility issues.

    I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?

      Hi

      ikegami said:

      was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?

      I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database.

      There have been proposals to add encoding support at the DBI level, but I've not heard about them being released yet.

      As for why it's not doing what is desired for the O.P... Perhaps a full, self-contained test program would help?

      Regards

      FalseVinylShrub

      Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

        The problem is that it's a very large application so to break out anything self-contained is not possible.

        What I did notice is that everything was working just fine, my Template Toolkit templates have BOMs, my DB is all UTF8 encoded, my charsets were perfect.

        Everything worked great, probably because PERL was doing the right thing, but don't forget there's SIX places for UTF8 to get messed up:
        1) Template encoding
        2) HTTP headers
        3) HTML headers
        4) DB encoding
        5) DB handle
        6) The language itself

        That's suddenly a lot of room for forgetting one detail that throws everything else off.

        With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. But not only does it have to be Encoded, but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

        And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time, but the problem remains that it seems that the only way to be certain is to encode/decode all input and output and that's just not the way things should work, 10% of my programming should not have to be worrying about this issue.

        Tosh

        I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database.

        Yes, sorry, I got it backwards. I was thinking the enable_utf8 affected data sent to the DB, but it affects DB obtained from the DB.

        Either way, it's a very incomplete system. Only UTF-8 is supported (right?), and it's broken when it comes to data sent to the DB.

      I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?

      A number of DBD's dencode the data returned from the database including DBD::ODBC (in a unicode build of it or when instructed to with handle flags), DBD::Oracle, DBD::Pg and DBD::mysql. There may be others.

        I corrected myself already, but it's not nearly as functional as you make it sound. See the linked post.
Re: Mugged by UTF8, this CANNOT be right
by nif (Sexton) on Jan 26, 2011 at 21:29 UTC

    Just tried to use "$sth->fetchall_arrayref" with UNICODE in mysql table - with "$dbh->{mysql_enable_utf8}=1" it always returns right encoded strings - so no need to encode/decode yourself.
    Test details in <readmore> section below...

    And about

    "foreign accents"
    - a lot of Monks, who give you answers on this site, do not speak English at home.

      - a lot of Monks, who give you answers on this site, do not speak English at home.

      Especially the monks who know how to answer questions about Unicode!

      Edit: that's a bass ackwards way to extract the minutes anyhow, man date shows that date +'%M' would have sufficed. Oops, this is the wrong post for that totally irrelevant edit.

Re: Mugged by UTF8, this CANNOT be right
by tosh (Scribe) on Jan 28, 2011 at 09:14 UTC
    So on the lighter side of things...

    What do you Monks think would be easier?

    1) A simple solution to this issue?
    or
    2) Convince the world to give up their accents and adopt ASCII ;)

    At least now I don't feel like too much of a moron for not being able to figure this out.

    Tosh

      Probably 2), as there is no simple solution to the issue ;)

      You didn't provide any information on the bug (just your workaround for it). If you want us to help resolve your problem, you'll need to at least describe it to us.

      For starters, what output do you get and what output do you expect? It's best if you avoid "non-ASCII" characters by using tools that escape these (Data::Dumper with $Data::Dumper::Useqq=1;, od command line tool, etc)

      I would start with a simple description of the issue. Pretend I've read this thread twice and I'm still confused, then explain it like you're reporting a bug and I'm 2yo :)
Re: Mugged by UTF8, this CANNOT be right
by Jim (Curate) on Jan 30, 2011 at 06:47 UTC

    Tosh,

    Compare your Perl Unicode rant with mine from a few months ago. (I'd forgotten about this discussion thread. I just stumbled upon it tonight.)

    Jim

      Ha ha! ++ for using "Chicanery"!