in reply to DBI and encoding problem

Do you execute this sql statement when you first establish your connection to the database?
$dbh->do( "set names utf8" );
And where do the string values come from that you are using as parameters for the placeholders in your query? Are they flagged in your perl script as utf8 strings (because they come from a ":utf8"-mode file handle or from an Encode::decode("utf8",...) call)?

I'm actually grateful to you for posing this question, because it led me to realize that perl's "special" handling of unicode characters in the \x{80} - \x{ff} range seems to require that you do some "special" handling when passing strings to mysql. Here's a test script:

#!/usr/bin/perl use strict; use warnings; use Encode; use DBI; use utf8; # added by update, lest there be any doubt about this my $DSN = "DBI:mysql:database=test;host=localhost"; my $db = DBI->connect( $DSN, "", "", { AutoCommit => 1, RaiseError => +0 } ) or die $DBI::errstr; my @setup = ( 'SET NAMES utf8', 'DROP TABLE IF EXISTS widechar_test', 'CREATE TABLE widechar_test (string varchar(50)) DEFAULT CHARACTER + SET utf8', ); for my $s ( @setup ) { $db->do( $s ) or die $db->errstr; } my $sth = $db->prepare( "INSERT INTO widechar_test (string) VALUES (?) +" ) or die $db->errstr; my $latin = "\x{e7}\x{e1}"; my $greek = "\x{3a0}\x{397}\x{3a1}\x{39b}"; printf( " char lengths: %d latin, %d greek\n", length( $latin ), length( $greek )); $sth->execute( prep( $latin )) or warn "latin char insert failed: ".$s +th->errstr; $sth->execute( prep( $greek )) or warn "greek char insert failed: ".$s +th->errstr; { use bytes; printf( " byte lengths: %d latin, %d greek\n", length( $latin ), length( $greek )); $sth->execute( prep( $latin )) or warn "latin byte insert failed: +".$sth->errstr; $sth->execute( prep( $greek )) or warn "greek byte insert failed: +".$sth->errstr; } $sth->finish; binmode STDOUT, ":utf8"; my $rows = $db->selectall_arrayref( "select * from widechar_test" ); for my $row ( @$rows ) { print deprep( $$row[0] ); } sub prep { return encode( "utf8", $_[0] ); # return $_[0]; } sub deprep { return decode( "utf8", $_[0] ) . "\n"; # return "$_\n"; }

The point is that perl still uses a single byte internally for characters in the original the Latin-1 range (from iso-8859-1), and even when perl has flagged a latin-1 string as being utf8 data, you should "encode" it into an external (true multibyte) form before sending it to the database.

(Try changing the "prep" and "deprep" subs in the test script, to skip the "encode" and "decode" calls, and you'll see a difference in the results.)

update: added "use utf8" to the test script, which did not change its behavior, but might help clarify the issue.

Replies are listed 'Best First'.
Re^2: DBI and encoding problem
by way (Sexton) on Feb 27, 2009 at 17:34 UTC

    Thank for your response, I yes changed the client connection to UTF-8, i update this post with my solution, based in yours response.