comment on

Do you execute this sql statement when you first establish your connection to the database?

  $dbh->do( "set names utf8" );
[download]

And where do the string values come from that you are using as parameters for the placeholders in your query? Are they flagged in your perl script as utf8 strings (because they come from a ":utf8"-mode file handle or from an Encode::decode("utf8",...) call)?

I'm actually grateful to you for posing this question, because it led me to realize that perl's "special" handling of unicode characters in the \x{80} - \x{ff} range seems to require that you do some "special" handling when passing strings to mysql. Here's a test script:

#!/usr/bin/perl

use strict;
use warnings;
use Encode;
use DBI;
use utf8;  # added by update, lest there be any doubt about this

my $DSN = "DBI:mysql:database=test;host=localhost";
my $db = DBI->connect( $DSN, "", "", { AutoCommit => 1, RaiseError => 
+0 } )
    or die $DBI::errstr;

my @setup = (
    'SET NAMES utf8',
    'DROP TABLE IF EXISTS widechar_test',
    'CREATE TABLE widechar_test (string varchar(50)) DEFAULT CHARACTER
+ SET utf8',
    );
for my $s ( @setup ) {
    $db->do( $s ) or die $db->errstr;
}

my $sth = $db->prepare( "INSERT INTO widechar_test (string) VALUES (?)
+" )
    or die $db->errstr;

my $latin = "\x{e7}\x{e1}";
my $greek = "\x{3a0}\x{397}\x{3a1}\x{39b}";

printf( " char lengths: %d latin, %d greek\n",
    length( $latin ), length( $greek ));

$sth->execute( prep( $latin )) or warn "latin char insert failed: ".$s
+th->errstr;
$sth->execute( prep( $greek )) or warn "greek char insert failed: ".$s
+th->errstr;
{
    use bytes;
    printf( " byte lengths: %d latin, %d greek\n",
        length( $latin ), length( $greek ));
    $sth->execute( prep( $latin )) or warn "latin byte insert failed: 
+".$sth->errstr;
    $sth->execute( prep( $greek )) or warn "greek byte insert failed: 
+".$sth->errstr;
}
$sth->finish;

binmode STDOUT, ":utf8";
my $rows = $db->selectall_arrayref( "select * from widechar_test" );
for my $row ( @$rows ) {
    print deprep( $$row[0] );
}

sub prep {
    return encode( "utf8", $_[0] );
#    return $_[0];
}

sub deprep {
    return decode( "utf8", $_[0] ) . "\n";
#    return "$_\n";
}
[download]

The point is that perl still uses a single byte internally for characters in the original the Latin-1 range (from iso-8859-1), and even when perl has flagged a latin-1 string as being utf8 data, you should "encode" it into an external (true multibyte) form before sending it to the database.

(Try changing the "prep" and "deprep" subs in the test script, to skip the "encode" and "decode" calls, and you'll see a difference in the results.)

update: added "use utf8" to the test script, which did not change its behavior, but might help clarify the issue.

In reply to Re: DBI and encoding problem by graff
in thread DBI and encoding problem by way

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.