utf-8 problems

Sly_G has asked for the wisdom of the Perl Monks concerning the following question:

Using russian (non-ascii) symbols (7-letter word):

use utf8;
use open OUT => ':utf8';
use DBI;
my $dbh = DBI->connect("DBI:mysql:database=mybase;host=localhost;port=
+3306", "login", "pass");
#$dbh->do('SET CHARACTER SET utf8');

open TST, '>utftest1.txt';
binmode TST;
print TST "&#1088;&#1091;&#1089;&#1089;&#1082;&#1080;&#1081;";

$test = $dbh->selectrow_array("SELECT '&#1088;&#1091;&#1089;&#1089;&#1
+082;&#1080;&#1081;'");
open TST, '>utftest2.txt';
binmode TST;
print TST $test;
[download]

Result: file utftest1.txt contains 14 bytes, I can see the word in it with any text editor (7 symbols x 2 bytes, makes sense). file utftest2.txt contains 28 bytes of I don't know what:

0000000000: C3 91 C2 80 C3 91 C2 83 | C3 91 C2 81 C3 91 C2 81
0000000010: C3 90 C2 BA C3 90 C2 B8 | C3 90 C2 B9
[download]

Uncommenting line with base character set changes nothing.

Comment on utf-8 problems Select or Download Code

Replies are listed 'Best First'.
Re: utf-8 problems by moritz (Cardinal) on Jan 09, 2012 at 16:05 UTC
You should pass `{mysql_enable_utf8 => 1}` to the constructor, and decode the output before writing to utftest2.txt: `open TST, '>:encoding(UTF-8)', 'utftest2.txt';` See also: A UTF8 round trip with MySQL, which explains all the details and contains a worked example. Perl 6 - second systems done right	[reply] [d/l] [select]
Re: utf-8 problems by mbethke (Hermit) on Jan 09, 2012 at 21:48 UTC
What Moritz said. For an explanation of what happens: your UTF8 goes into the DB fine but when it comes out, Perl thinks its bytes were Latin-1 and re-encodes these to UTF-8. As the bytes all fall into the upper half of the 8-bit character set, the UTF-8 representation is twice that long.	[reply]