UTF8 between versions

edgreenberg has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I moved some code from one server to another. The versions of perl, dbi and dbd-mysql changed.

old:

perl.x86_64 4:5.8.8-43.el5_11
perl-DBD-MySQL.x86_64 3.0007-2.el5
perl-DBI.x86_64 1.52-2.el5

new:

perl.x86_64 4:5.10.1-141.el6
perl-DBD-MySQL.x86_64 4.013-3.el6
perl-DBI.x86_64 1.609-4.el6

All the perl-Unicode modules are installed.

Data passes over the wire from mysql with unicode in it. For instance, my e-acute comes over the wire as two bytes, 0xC3 0xA9.

On the old server, it is printed as two bytes. On the new server, it is printed as one byte: E9. On the new server perl's length function reports the string as having the correct number of characters as a one-byte character.

Printing the line with Encode::encode_utf8 causes it to come out correctly, but this requires combing a multitude of code to ad the encode_utf8 statement.

I tried adding mysql_enable_utf8 => 1 to the collection of options in the DBI->connect line, with no luck.

How can I avoid combing the code for places to add the encode_utf8 function.

Thanks,

Ed Greenberg

Comment on UTF8 between versions

Replies are listed 'Best First'.
Re: UTF8 between versions by graff (Chancellor) on Sep 29, 2015 at 03:05 UTC
You said: "On the new server, it is printed as one byte: E9." If that is really true (a perl string value representing e-acute, when printed, really is just one byte), I would expect that there is something else on the new server (not directly involved with DBI or DBD, and maybe only tangentially related to Perl) that is converting data from utf8 to some non-Unicode encoding (typically cp1252 or iso-8859-1) - I don't know what it could be (maybe a locale setting or other detail affecting your shell environment?), but some other tests of unicode/utf8-related actions on old vs. new systems (assuming the old system is still accessible/usable) could help diagnose where the difference(s) may be. As for "combing a multitude of code" to handle the encoding difference properly (assuming that the fix can only be the one you've discovered so far, using Encode::encode_utf8 where necessary), how many times (in how many places) do you have: `use DBI;`? One approach would be to change that to something like: `use MyDBI;`, and compose your module to inherit most of the functionality of DBI, but only replace the functions you use that actually return strings from the database. Your module would start with: `package MyDBI; @ISA = qw(DBI::db); use DBI;` [download] Ideally, there would be some other, config-level change, like the `mysql_enable_utf8` setting that you've already tried, that would make this approach unnecessary. But writing your own 'wrapper' module for DBI is a sensible and workable fallback. BTW, if you can manage it, it may help to show us relevant bits of your code, so we see how you've done the DBI connect and a given query, plus snippets of actual data (preferably as Data::Dumper style output). One other issue you didn't mention: are your old and new systems both talking to the same Mysql server, or was there an upgrade of that as well?	[reply] [d/l] [select]
Re^2: UTF8 between versions (Peek) by tye (Sage) on Sep 29, 2015 at 04:11 UTC
(preferably as Data::Dumper style output) When dealing with UTF-8 problems, much better to provide Devel::Peek output. - tye	[reply]
Re: UTF8 between versions by Anonymous Monk on Sep 28, 2015 at 23:27 UTC
Here is a generic idea :) ?Have you seen A UTF8 round trip with MySQL? When you're having trouble, a simple round trip test case will help you figure out whats going wrong (connect, create table, put some unicode in table, retrieve some unicode)	[reply]