Re^3: substitution regex and unicode

You should be *encoding* what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by mysql_enable_utf8 => 1) into bytes by encoding them.

print encode("UTF-8", "$row[0]\t$row[1]\n");
[download]

binmode(STDOUT, ':encoding(UTF-8)');
print "$row[0]\t$row[1]\n";
[download]

Comment on Re^3: substitution regex and unicode Select or Download Code

Replies are listed 'Best First'.
Re^4: substitution regex and unicode by frasco (Beadle) on May 07, 2008 at 18:40 UTC
I understand what you mean and I tried it, but it doesn't work! On the countrary if I leave my script untouched it properly works: html page source (as it is shown by firefox 2) shows multibyte characters as they are, that is it doesn't make use of the corresponding html entities. Sorry I don't understand why I should encode data again. Probably you mean that I should process data into binary if I send them back to the database. That is the un-encoded and encoded output: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> <head> <title>xxx</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> r.1,1 1 ʾŕ-da-um^túg-Ⅱ 1 aktum^túg 1 íb-iv^túg sa₆ dar r.1,2 NI-ra-ar^ki r.1,3 2 ʾŕ-da-um^túg-ii 1 ʾŕ-da-um^túg-i and this if the encoded output: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> <head> <title>Progetto Sinleqiunnini</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> r.1,1 1 ĘžĂ -da-um^tĂşg-â…Ą 1 aktum^tĂşg 1 Ăb-iv^tĂşg saâ‚† dar r.1,2 NI-ra-ar^ki r.1,3 2 ĘžĂ -da-um^tĂşg-ii 1 ĘžĂ -da-um^tĂşg-i Even if these lines are not intellegible (it is a III millennium b.C. lenguage) only the first example is correct.	[reply]
Re^5: substitution regex and unicode by ikegami (Patriarch) on May 07, 2008 at 22:41 UTC
I never said you should encode data again. I said characters need to be encoded. Once a character is encoded, it becomes a series of bytes. I could comment elaborate, but it would help to know what I'm commenting about (i.e. see the changes you've made to your program).	[reply]