Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Handle UTF-8 with DBI

by fraktalisman (Hermit)
on Jan 03, 2006 at 16:03 UTC ( #520649=perlquestion: print w/replies, xml ) Need Help??

fraktalisman has asked for the wisdom of the Perl Monks concerning the following question:

Although Perl seems to handle UTF-8 correctly, this is not the case when reading data with DBI::MySQL. In disussions on CPAN and elsewhere, I read about the solution, to set the utf8 flag for the database data by calling Encode::decode_utf8.

I have still a problem with Encode::decode_utf8. If the string to be decoded is already UTF-8, Encode::decode_utf8 returns an empty string! There seems to be no way to check if decoding is necessary, because in both cases (decoding successful or empty string), the string to be decoded was marked as "not utf" when I test it with Encode::is_utf8 .

Is there a good way to avoid this problem?

And what will happen, if a future version of DBI and DBD already handles the SQL data correctly? Will the decoding still work, or do we have to re-write all scripts then?

Replies are listed 'Best First'.
Re: Handle UTF-8 with DBI
by borisz (Canon) on Jan 03, 2006 at 17:36 UTC
    you can use this:
    $utf8 = Encode::decode(utf8 => $utf8) unless Encode::is_utf8($utf8);
    or if you just loose the utf8 flag and you know it is utf8 just do
    or use DBD::Pg which already handle utf8 data for you.
Re: Handle UTF-8 with DBI
by idsfa (Vicar) on Jan 03, 2006 at 17:25 UTC

    Perhaps you want Class::DBI::utf8 (which attempts to Do The Right Thing)? Even if you do not need the Class::DBI structure, you should be able to find what you need in the source. Extracting wildly:

    if (defined $string) { utf8::upgrade($string); Encode::_utf8_on($string); Encode::_utf8_off($string) if (!utf8::valid($string)); }

    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon
Re: Handle UTF-8 with DBI
by ioannis (Abbot) on Jan 03, 2006 at 18:34 UTC
    For some DBD drivers, utf8 can be enabled from the driver itself. For example, DBD::Pg has the internal function pg_enable_utf8( boolean) to enable the utf8 flag for strings. (According to the Pg manual, this also requires perl 5.8 and later).
Re: Handle UTF-8 with DBI
by valdez (Monsignor) on Jan 03, 2006 at 20:44 UTC

    If your database was created specifying utf8 as charset, everything you will get from it will be valid utf8 but not marked as utf8; so you only need to switch on the utf8 flag on. nothingmuch posted a link to an interesting patch for DBD::mysql at node UTF8 vs SQLite.

    HTH, Valerio

Re: Handle UTF-8 with DBI
by Errto (Vicar) on Jan 04, 2006 at 15:28 UTC

    IIRC I had a similar problem which was fixed by upgrading to the latest version of DBD::mysql so you might want to make sure your provider is running that.

    The empty string is coming because the default behavior of Encode is to silently drop any byte sequence it is not able to decode from the byte stream using the given character encoding (UTF-8 in this case). If you want you can instead have it die and give you the exact byte sequence it's trying to decode using the FB_CROAK option (see the Encode docs for more). If none of this works try taking a hex dump of the returned string using unpack and make sure it is correct UTF-8.

    Last, you should make sure MySQL is using UTF-8 as the encoding for the given database or column.

Re: Handle UTF-8 with DBI
by fraktalisman (Hermit) on Jan 04, 2006 at 15:16 UTC

    Thanks for your advice so far. I am currently testing the solution with utf8::upgrade and Encode::_utf8_on .

    Unfortunately there is at least one popular German webspace provider still running Perl 5.6 without Encode module. I phoned them this morning and requested them to install 5.8 as soon as possible. I also tried to save the script itself in UTF-8 (at the other provider which uses 5.8.0), but Apache (1.3.27) complained with an "internal server error" despite use utf8; in my script.

    My original intention was just to use UTF-8 to make my (and my company's) scripts more versatile, but I have been quite disappointed to learn that certain components and institutions seem not to be ready for internationalization yet.

    Anyway, after all the time I already spent testing, I will hopefully end up with a working solution.

Re: Handle UTF-8 with DBI
by randyk (Parson) on Jan 05, 2006 at 03:15 UTC
Re: Handle UTF-8 with DBI (JS problem rather than a perl one)
by fraktalisman (Hermit) on Jan 14, 2006 at 11:07 UTC

    Obviously, the malformed data was no perl problem but a problem with javascript and then the resulting mojibake (malformed data with funny characters) was passed on to perl. With Perl 5.8, doing nothing about the data encoding now seems to work fine! (After long days of testing ...)

    And although JavaScript is supposed to handle UTF-8 data, according to Mozilla's specification, JScript (pseudo JavaScript in Internet Explorer) doesn't. So I try to avoid UTF-8 characters inside JS code now. OTOH, UTF-8 content in HTML forms that are handled by JS seem to be no problem at all.

    Finally, the admins at the-renowned-provider-still-running-old-perl say it would be too laborious for them to upgrade to 5.8, so they will just keep Perl 5.6.1 on their servers in the near future. I dread this means that a lot of my customers will actually have to move their websites to another provider (the one that's still small enough to actually listen to what their customers say).

    Update: Perl 5.6.1 does handle UTF-8 data, as long as it's correct. The only problem that's left would be malformed characters, the rest is working fine now with both perl versions.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://520649]
Approved by phaylon
Front-paged by Aristotle
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2022-06-29 18:39 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (97 votes). Check out past polls.