- Make sure that mysql is configured to use UTF-8 for everything
- Make sure the table is stored as UTF-8
- Use a recent DBD::mysql.
- Execute set names 'utf8'
- pass decoded strings to execute()
- ...
- profit
| [reply] [d/l] [select] |
| [reply] |
$ echo Kärnten
Kärnten
$ echo Kärnten | hexdump -C
00000000 4b c3 a4 72 6e 74 65 6e 0a |K..rnten.
+|
00000009
If it looks any different, that's your first problem.
Second step, try mysqldump $your_table | grep 'CHARSET=' - does that say UTF8 as well?
Third step, try the insertion and then retrieval with mysql command line client - does that work?
Fourth step, tell me the version of your DBD::mysql module. | [reply] [d/l] [select] |
I can't test on your exact setup, but the versions that I have probably don't not differ from yours in any relevant detail:
mysql -- 5.0.51b
DBI -- 1.52
DBD::mysql -- 4.007
Now, I'll confess that I don't have what qualifies as a deep understanding of how this all works, but I can at least report some test results, which include at least one test condition that appears to work.
But first, I'd like to point out that using accented characters in literal (quoted) strings in your perl script could be one source of trouble -- if you're going to do that, it's good to check it by printing the string to a utf8-mode file handle, or just looking at it with Data::Dumper, to make sure your text editor is saving it (and perl is seeing it) as utf8 data.
That said, here is a little test script that works for me in the setup described above. This assumes that your mysql server still has a "test" database where you can create tables and do stuff as an "anonymous" user. The comments with "fn. #" refer to footnotes that follow. Try this as posted, and if the output ends with "that's bad", then something is probably wrong with your DBI and/or DBD::mysql installation.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
use Encode;
binmode STDOUT, ":utf8";
#my $DSN = "DBI:mysql:database=test;host=localhost;mysql_enable_utf8=1
+"; # fn.1
my $DSN = "DBI:mysql:database=test;host=localhost";
my $dbh = DBI->connect( $DSN, "", "" );
my @cols = qw/country city pronunciation/;
my $create_table = "create table test_utf8 (".
join( ',', map { "$_ varchar(50)" } @cols ) . ")";
$dbh->do( "drop table if exists test_utf8" );
$dbh->do( "$create_table default character set=utf8" );
$dbh->do( "set names utf8" ); # fn.2
my @src_data = ( "Germany", "K\xf6ln", "k\x{0153}ln" );
print "Sending to db: @src_data\n";
#my @tst_data = @src_data; # fn.3
my @tst_data = map { encode( 'utf8', $_ ) } @src_data;
my $sth = $dbh->prepare( "insert into test_utf8 (".
join( ',', @cols ). ") values (?,?,?)" );
$sth->execute( @tst_data );
my ( @db_data ) = $dbh->selectrow_array( "select ".join( ',', @cols ).
" from test_utf8" );
$_ = decode( 'utf8', $_ ) for ( @db_data ); # fn.4
print "Fetch from db: @db_data\n";
my $diff = 0;
for ( 0 .. $#db_data ) {
$diff++ if ( $db_data[$_] ne $src_data[$_] );
}
print "So that's bad\n" if $diff;
Footnotes:
- With recent versions of DBD::mysql, you can try configuring the connection with "enable_utf8" turned on, but this is reported to be "experimental", and in any case seems to have no bearing on the behavior for the present test.
- The "set names utf8" is a more traditional method for making sure that mysql understands that you want to use utf8 data for tables, but again, it seems to be somewhat peripheral (not mandatory) for the case at hand. (But it does have an impact on internal storage: without this, putting wide characters into utf8 fields will cause any two-byte utf8 character to take up four bytes in the mysql table file -- not good.)
- There's a difference in behavior between "Latin1" characters (\x80-\xFF) vs. higher-order unicode characters (\x100 and above); if you uncomment the "fn.3" line and comment out the line after it, you'll see the difference.
- This decoding of stuff returned from mysql is necessary, regardless of "set names utf8" or "enable_utf8" or table/column definitions or whatever.
You can play with variations as you like, but the particular arrangement shown should work as-is, and should be a good-enough template to work from. Good luck.
UPDATE: I just noticed that some of this same ground was covered in the first thread you ever posted here at the Monastery: utf8 characters in tr/// or s///. Not only did you forget what we told you back then, but I also forgot what ikegami taught me: you can use utf8::decode( $string ) instead of the slightly more cumbersome $string = Encode::decode( 'utf8', $string ) -- but the latter offers a handy third parameter (for trapping bad data), and is not considered "experimental". | [reply] [d/l] [select] |
Thanks for that Graff.
I gave that a run and it worked fine. But I think that's because the UTF data is written into the script. It was the end of along day when I wrote that last night so I wasn't really explaining it too well.
If I actually hard code the phrase into the script:
$blah->execute($country, 'Kärnten');
the execute works fine. I only get the problem with a variable containing UTF8.
Now, my understanding is that with mysql expecting UTF8 (just like it's happy with in every other script) I should be able to just make sure $province is utf8:
if (!utf8::is_utf8($province)) {
utf8::decode($province);
}
I've been trying a few more things this morning. I'm printing output to a browser (so I know exactly which character set everything is in). When the browser is in UTF-8 mode, $province displays correctly, but utf8::is_utf8($province) is false. WTF? I am 100% sure $province is UTF8. It was taken from the UTF8 database as UTF8. It displays in a UTF8 browser as UTF8. but utf8::is_utf8() says it's not.
If, however, I fudge the variable so that it contains a character which does not exist in 8859-1, everything starts working normally again.
'Kärntenš' with the funny 's' character on the end reports as UTF8 from is_utf8, displays as utf8 and goes into the database without being truncated, with no other changes to the code.
Can anybody enlighten me as to what this means?
This is driving me insane and I'm going round and round in circles. I thought I'd understood all this perl UTF8 stuff because everything seemed to be working fine, and it still does if I prepare() the variables in, but just not when they're sent to DBI::execute() as arguments. ARGH!
I'm going to take a break for the sake of my sanity :-) | [reply] [d/l] |
Woohoo! I think I've found it! I'm still testing a few things but so far it seems to be working.
I noticed that the variable containing the UTF chars was decoded using utf8::decode() without first checking it with utf8::is_utf8(). It turns out that it was already decode()ed, so I was in fact double decode()ing. This seems to just remove the utf8 attribute and somehow put it in a very wierd sometimes UTF8/sometimes not UTF8 state.
Anyway, I've corrected this now and it seems to be working.
What *should* happen to a double decode()ed string? I would have thought that nothing would happen, but clearly that's not the case. I guess this is just a usage error.
Any ideas why a double decode()ed string works fine where prepare()ed, but not when sent as an execute() argument?
Cheers guys. I really appreciate your masterful help!
MattLG
| [reply] |
There is convenience function Encode::decode_utf8
| [reply] |