wes has asked for the wisdom of the Perl Monks concerning the following question:

I've had problems with making DBI calls to a SQL 2005 DB with UTF8 params. I'm on Windows. After some troubleshooting, I found that the strings seemed to be encoded correctly throughout the whole call chain, from the Ajax call to my script and back to the browser. The only place that had issues was the DB, and I couldn't figure out why. I had a friend suggest that I get a hash of the search params before and after the utf8 calls I was making, since it's difficult to tell if my strings are encoded right or if its the viewer fixing things for me. So here's where it gets weird: taking the hash fixes my issues, DB calls work and all is well. This doesn't work:
my $search = $cgi->param("query").'%'; utf8::decode($search); # database calls here...
But this does:
my $search = $cgi->param("query").'%'; use Digest::MD5 qw(md5_hex); utf8::decode($search); md5_hex($search); # database calls here...
And switching to Digest::Perl::MD5 doesn't work. Could it have something to do with passing to XS? Is there a Better Way to do this?

Replies are listed 'Best First'.
Re: UTF spooky action
by ikegami (Patriarch) on Apr 17, 2009 at 23:32 UTC

    md5_hex expects each input character to be a byte (ord = 0..255), whether internally encoded as UTF-8 or not. That's fine.

    • One can only get the the MD5 of bytes.
    • Internal encoding shouldn't matter.

    For simplicity and efficiency, it converts these characters into actual bytes if the internal encoding is UTF-8 using utf8::downgrade or something equivalent.

    • "\311" (PV="\303\211",UTF8=1)
      becomes equivalent
      "\311" (PV="\311",UTF8=0).

    This means your database or database interface (as used) expects bytes, so it expects your text to be *encoded*.

    • Which encoding?
    • Can the interface be configured to accept arbitrary Unicode text?
      Thanks for the help, ikegami. utf8::downgrade had the same effect.
      ADO can be set up to accept utf8 through Win32::OLE. I had thought that my script already was set up like that, but when I looked at where it was declared, I saw that it was done in some old, probably deprecated way. Setting Win32::OLE->Option( CP => Win32::OLE::CP_UTF8 ) seems to solve all my problems. Thanks again!