in reply to Re: substitution regex and unicode
in thread substitution regex and unicode

Thank you Joost. I understood my mistake (and just this is a great goal)! When I retrieve data from MySQL I didn't tell it to make use of {mysql_enable_utf8 => 1}:
$dbh = DBI->connect($datasource, $user, $passw, {mysql_enable_utf8 => +1})
If I well understand perl now has already all what he needs to work with unicode strings and, consequently, with regex. Thus I must delete the line decode("utf8"...) at the very end of my scrit and let alone those statements that must be printed out. Thank you again for submitting me that link.

Replies are listed 'Best First'.
Re^3: substitution regex and unicode
by ikegami (Patriarch) on May 03, 2008 at 10:45 UTC

    You should be *encoding* what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by mysql_enable_utf8 => 1) into bytes by encoding them.

    print encode("UTF-8", "$row[0]\t$row[1]\n");
    or
    binmode(STDOUT, ':encoding(UTF-8)'); print "$row[0]\t$row[1]\n";
      I understand what you mean and I tried it, but it doesn't work! On the countrary if I leave my script untouched it properly works: html page source (as it is shown by firefox 2) shows multibyte characters as they are, that is it doesn't make use of the corresponding html entities. Sorry I don't understand why I should encode data again. Probably you mean that I should process data into binary if I send them back to the database.
      That is the un-encoded and encoded output:
        <!DOCTYPE html
      	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
      	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
      <head>
      <title>xxx</title>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      </head>
      <body>
      

      r.1,1 1 ʾà-da-umtúg-Ⅱ 1 aktumtúg 1 íb-ivtúg sa₆ dar

      r.1,2 NI-ra-arki

      r.1,3 2 ʾà-da-umtúg-ii 1 ʾà-da-umtúg-i

      and this if the encoded output:
      <!DOCTYPE html
      	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
      	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
      <head>
      <title>Progetto Sinleqiunnini</title>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      </head>
      <body>
      

      r.1,1 1 ʾà -da-umtúg-Ⅱ 1 aktumtúg 1 íb-ivtúg sa₆ dar

      r.1,2 NI-ra-arki

      r.1,3 2 ʾà -da-umtúg-ii 1 ʾà -da-umtúg-i

      Even if these lines are not intellegible (it is a III millennium b.C. lenguage) only the first example is correct.

        I never said you should encode data *again*. I said *characters* need to be encoded. Once a character is encoded, it becomes a series of bytes.

        I could comment elaborate, but it would help to know what I'm commenting about (i.e. see the changes you've made to your program).