I've now had a chance to look at this more as I was a bit surprised you had to do this.
The main issue is you are using verchar columns and not nvarchar columns and you did not tell us how the data got into the database in the first place. Let's assume the data was inserted into the database with DBD::ODBC in the first place and see what happens
use strict;
use warnings;
use DBI qw(:utils :sql_types);
use 5.016;
use Data::Dumper;
use Encode;
my $h = DBI->connect() or die $DBI::errstr;
$h->{RaiseError} = 1;
eval {
$h->do(q/drop table chinese/);
};
$h->do(q/create table chinese (n varchar(30), d varchar(12))/);
my $str = "\x{20020}\x{24202}\x{31185}\x{23460}";
my $encoded = encode('UTF-8', $str);
foreach my $s (split(//, $encoded)) {
print sprintf("%x,", ord($s));
}
say "";
say data_string_desc($str);
my $s = $h->prepare(q/insert into chinese values(?,?)/);
$s->execute($str, "111");
$s = $h->prepare(q/select cast(n as varbinary) from chinese/);
$s->execute;
my $r = $s->fetchall_arrayref;
foreach my $s (split(//, $r->[0][0])) {
print sprintf("%x,", ord($s));
}
say "";
say data_string_desc($r->[0][0]);
my $d = decode('UTF-8', $r->[0][0]);
say data_string_desc($d);
print Dumper(\$d);
which outputs:
f0,a0,80,a0,f0,a4,88,82,f0,b1,86,85,f0,a3,91,a0,
UTF8 on, non-ASCII, 4 characters 16 bytes
f0,a0,80,a0,f0,a4,88,82,f0,b1,86,85,f0,a3,91,a0,
UTF8 off, non-ASCII, 16 characters 16 bytes
UTF8 on, non-ASCII, 4 characters 16 bytes
$VAR1 = \"\x{20020}\x{24202}\x{31185}\x{23460}";
When I look under the hood at what ODBC APIs DBD::ODBC is calling I see DBD::ODBC took the 4 chinese characters that are UTF-8 encoded in Perl (see first lines of output) and it converted them to UTF16 and bound them as SQL_WCHARs. Note, varchar columns are not really designed to store unicode data. If we convert what SQL Server thinks it has in the column to varbinary and read it back we miracoulously get the chinese string UTF-8 encoded back (the 3rd/4th lines of output). Unsurprisingly we can decode that UTF-8 back into a string in Perl (the last 2 lines of output). In other words, SQL Server does not know that is UTF-8 encoded data and length functions and collation will not work.
If we rewrite the select code to a simple "select * from chinese" we seem to get back rubbish. What has happened here is that DBD::ODBC bound what it believed to be char data as SQL_WCHARs, SQL Server passed back UTF16 encoded representation of the 16 bytes and DBD::ODBC decoded it to UTF-8 and hence it is now double encoded UTF-8.
If we had changed the code above to bind the input data as SQL_WCHARs SQL Server would have just put a load of question marks in the column as it cannot do what you want.
The correct way to do this with SQL Server is to make the column nvarchar and then everything will just work (except collations when characters don't fit into UCS-2 until SQL Server 2012 - see Inserting unicode characters > 0xFFFF (surrogate pairs) into MS SQL Server with Perl DBD::ODBC for why). |