stepamil has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,

I have a problems with DBIx/DBI and UTF8 chars. I use 0.08195 version of DBIx and the problem is that on my dev machine I get correctly encoded chars like "\x{6211}\x{4eec}\x{4e0e}", but on my staging machine I get "???" instead. DB, table, code - everything the same.

Final Update: I've went through the thread once again and checked DBD::mysql. On dev was 4.018 version, but on stg it was 3.x. After the update everything worked perfectly. I just cannot belive that I somehow (while checking) missed to notice this. Thanks a lot for your help. It has been a tricky one for me, and, apart from learning the cause to this, I learned extra on utf8 difficulties.

Update 8.8.2011. 15:57 - This might not be Perl/DBI/DBIx problem. Still investigating.

Check below:

Table:

CREATE TABLE `translation` ( `id` int(10) NOT NULL auto_increment, `section_id` int(11) NOT NULL, `language_id` char(4) NOT NULL, `translator_id` int(11) default NULL COMMENT 'Translator to blame fo +r this translation', `version` int(10) unsigned default NULL, `value` text, `validation_count` tinyint(3) unsigned default '0', PRIMARY KEY (`id`), UNIQUE KEY `natural_key` (`section_id`,`language_id`,`version`), KEY `index_language` (`language_id`), KEY `index_section` (`section_id`), KEY `index_version` (`version`), KEY `index_translator` (`translator_id`), FULLTEXT KEY `fulltext_value` (`value`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC

DBIx connection with utf8 set:

package Models; use strict; use warnings; use Data::Dumper; use System::Config; use base 'DBIx::Class::Schema'; my @connection_args = ( $c->{db_dsn}, $c->{db_username}, $c->{db_password}, { loader_options => { debug => 1, use_namespaces => 1, mysql_auto_reconnect => 0, }, mysql_enable_utf8 => 1, } ); __PACKAGE__->connection(@connection_args); __PACKAGE__->load_namespaces;

Model

package Models::Result::Translation; use strict; use warnings; use base 'DBIx::Class::Core'; __PACKAGE__->table("translation"); __PACKAGE__->add_columns( "id", { data_type => "integer", extra => { unsigned => 1 }, is_nullable => + 0 }, "section_id", { data_type => "integer", extra => { unsigned => 1 }, is_nullable => + 0 }, "language_id", { data_type => "char", is_nullable => 0, size => 4, }, "translator_id", { data_type => "integer", extra => { unsigned => 1 }, is_nullable => + 0 }, "version", { data_type => "integer", extra => { unsigned => 1 }, is_nullable => + 0 }, "value", { data_type => "text", is_nullable => 0 }, "validation_count", { data_type => "integer", extra => { unsigned => 1 }, is_nullable => + 0 }, ); __PACKAGE__->set_primary_key("id"); 1;

DBIx in script:

#!/usr/bin/perl -w use Data::Dumper; use strict; use Models; print STDERR Dumper [ map{ { $_->get_columns } } Models->resultset('Tr +anslation')->search({ language_id => 'zhCN'})->all() ];

Correct output on my dev mechine:

[... { 'translator_id' => '99', 'section_id' => '523', 'version' => '1', 'value' => "\x{6211}\x{4eec}\x{4e0e}\x{7f51}\x{7edc}\x{4e0 +a}\x{6700}\x{70ed}\x{95e8}\x{7684}\x{54c1}\x{724c}\x{5408}\x{4f5c}\x{ +ff01}\x{8bf7}\x{6d4f}\x{89c8}\x{60a8}\x{6700}\x{5fc3}\x{4eea}\x{7684} +\x{5355}\x{54c1}\x{ff0c}\x{8bf7}\x{65f6}\x{5e38}\x{56de}\x{8bbf}\x{62 +11}\x{4eec}\x{7684}\x{7f51}\x{7ad9}\x{ff0c}\x{4e86}\x{89e3}\x{6700}\x +{65b0}\x{4e0a}\x{67b6}\x{548c}\x{66f4}\x{65b0}\x{7684}\x{5546}\x{54c1 +}\x{3002} ", 'validation_count' => '0', 'id' => '31373', 'language_id' => 'zhCN' }, { 'translator_id' => '99', 'section_id' => '244', 'version' => '1', 'value' => "\x{7f16}\x{53f7}", 'validation_count' => '0', 'id' => '31365', 'language_id' => 'zhCN' },...]

Incorrect output on my staging machine:

[... { 'translator_id' => '99', 'section_id' => '1283', 'version' => '1', 'value' => '?????????????????????????????????????????????? +???????????????????????????????????????????? ', 'validation_count' => '0', 'id' => '38392', 'language_id' => 'zhCN' }, { 'translator_id' => '99', 'section_id' => '1274', 'version' => '1', 'value' => '?????????????????????????????????????????????? +?????????????????????????????????????', 'validation_count' => '0', 'id' => '38394', 'language_id' => 'zhCN' }, ...]

Please help, Stepamil

Replies are listed 'Best First'.
Re: DBIx returns question marks
by moritz (Cardinal) on Aug 07, 2011 at 10:03 UTC

      I've just tried connecting staging machine to dev DB and in didn't work. "?????" all over the place.

        You never explicitly encode your output when you print it, so my guess is that you either have different locales on the machines, or use different terminal settings between the two machines. You can check whether the output of your program is good by hexdumping the output instead of printing raw bytes:

        perl -w myprogram | od -x

        If the output is the same between the two environments, then the problem is in your terminal settings, installed fonts or something like that. If the output is different, one problem happens somewhere earlier, maybe when writing to the database, reading from the database or outputting the data from your program.

      On both machines DBD::mysql is 4.018 and as for locals both have the pretty much the same, but the dev machine has DB locally and DB for staging is on another DB dedicated machine. Hm, I'll try to play with using different DBs just to see what will come of it.

Re: DBIx returns question marks
by tinita (Parson) on Aug 07, 2011 at 15:34 UTC
    In addition to the other suggestions you might want to try:
    use Devel::Peek; Dump $object->value;

    You'll see the exact content and if the variable has the utf8 flag set or not.
    Also, how do you check the database content? Is mysqldump showing the same for both tables?

      Hi All,

      thanks for excellent ideas I'll start working on them right now (in a slow sunday night way). Meanwhile, I've discovered that when I do this on dev machine:

      show variables like "%character%"; show variables like "%collation%";

      I get this:

      character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server utf8 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ ---------- collation_connection utf8_general_ci collation_database utf8_general_ci collation_server utf8_general_ci

      And on my staging machine is this:

      character_set_client utf8 character_set_connection utf8 character_set_database latin1 character_set_filesystem binary character_set_results utf8 character_set_server utf8 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ ------------------ collation_connection utf8_general_ci collation_database latin1_swedish_ci collation_server utf8_general_ci

      That cannot be good. I'll play during the night and post my results in the morning.

      Thanks a lot, Stepamil

        Hi, me again,

        Thanks Corion for giving me this list. This one's a keeper :) I went through it and here's what I have:

        I've connected staging machine to use dev DB so both scripts are using the same DB. I've also removed DBIx from equation because I have the same behaviour on DBI, so I'll change the thread title shortly.

        Corion: Check the data format in your script as it writes to its output. Is it properly encoded from Unicode to the target encoding? Compare the hexdumps.

        Me: Dev -> var is UTF8, endoded:\x{7f16}\x{53f7} hex:e7bc96e58fb7, Stg -> var is not UTF8, var:??, hex:3f3f

        Corion: Check the data format in your script as it reads the data from the database. Is it utf8 everywhere? Compare the hexdumps.

        Me: No, the ?? variable is not UTF8 when I get it from staging machine

        Corion: Check the data format in your database in all tables. Are the tables/columns declared to be utf8/Unicode everywhere? Compare the hexdumps.

        Me: I'm using the same database, but different machine. Dev is using local DB and staging is using dev's DB. So that should be covered.

        Corion: Check the data format when it is written from your script to the database. Do they write utf8/Unicode everywhere? Compare the hexdumps.

        Me: Same as previous answer - same DB, different machine

        Corion: Check the data format from where your script gets its data. Is it properly decoded from the source to utf8 everywhere? Compare the hexdumps.

        Me: Also covered by previous answer

        Well, fighting with this continues. My more experienced colleague is coming from holiday this afternoon. He'll be fresh pair of eyes on this issue.

        Stepamil