Re^2: substitution regex and unicode

Thank you Joost. I understood my mistake (and just this is a great goal)! When I retrieve data from MySQL I didn't tell it to make use of {mysql_enable_utf8 => 1}:

$dbh = DBI->connect($datasource, $user, $passw, {mysql_enable_utf8 => 
+1})
[download]

If I well understand perl now has already all what he needs to work with unicode strings and, consequently, with regex. Thus I must delete the line decode("utf8"...) at the very end of my scrit and let alone those statements that must be printed out. Thank you again for submitting me that link.

Comment on Re^2: substitution regex and unicode Download Code

Replies are listed 'Best First'.
Re^3: substitution regex and unicode by ikegami (Patriarch) on May 03, 2008 at 10:45 UTC
You should be encoding what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by `mysql_enable_utf8 => 1`) into bytes by encoding them. `print encode("UTF-8", "$row[0]\t$row[1]\n");` [download] or `binmode(STDOUT, ':encoding(UTF-8)'); print "$row[0]\t$row[1]\n";` [download]	[reply] [d/l] [select]
Re^4: substitution regex and unicode by frasco (Beadle) on May 07, 2008 at 18:40 UTC
I understand what you mean and I tried it, but it doesn't work! On the countrary if I leave my script untouched it properly works: html page source (as it is shown by firefox 2) shows multibyte characters as they are, that is it doesn't make use of the corresponding html entities. Sorry I don't understand why I should encode data again. Probably you mean that I should process data into binary if I send them back to the database. That is the un-encoded and encoded output: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> <head> <title>xxx</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> r.1,1 1 ʾŕ-da-um^túg-Ⅱ 1 aktum^túg 1 íb-iv^túg sa₆ dar r.1,2 NI-ra-ar^ki r.1,3 2 ʾŕ-da-um^túg-ii 1 ʾŕ-da-um^túg-i and this if the encoded output: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> <head> <title>Progetto Sinleqiunnini</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> r.1,1 1 ĘžĂ -da-um^tĂşg-â…Ą 1 aktum^tĂşg 1 Ăb-iv^tĂşg saâ‚† dar r.1,2 NI-ra-ar^ki r.1,3 2 ĘžĂ -da-um^tĂşg-ii 1 ĘžĂ -da-um^tĂşg-i Even if these lines are not intellegible (it is a III millennium b.C. lenguage) only the first example is correct.	[reply]
Re^5: substitution regex and unicode by ikegami (Patriarch) on May 07, 2008 at 22:41 UTC
I never said you should encode data again. I said characters need to be encoded. Once a character is encoded, it becomes a series of bytes. I could comment elaborate, but it would help to know what I'm commenting about (i.e. see the changes you've made to your program).	[reply]