in reply to substitution regex and unicode

DBD::mysql does not provide unicode strings by default. You need to use version 4.004 or higher (earlier versions have serious unicode bugs) and set the mysql_enable_utf8 option.

See also A UTF8 round trip with MySQL (and take note of the replies there).

Replies are listed 'Best First'.
Re^2: substitution regex and unicode
by frasco (Beadle) on May 03, 2008 at 10:09 UTC
    Thank you Joost. I understood my mistake (and just this is a great goal)! When I retrieve data from MySQL I didn't tell it to make use of {mysql_enable_utf8 => 1}:
    $dbh = DBI->connect($datasource, $user, $passw, {mysql_enable_utf8 => +1})
    If I well understand perl now has already all what he needs to work with unicode strings and, consequently, with regex. Thus I must delete the line decode("utf8"...) at the very end of my scrit and let alone those statements that must be printed out. Thank you again for submitting me that link.

      You should be *encoding* what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by mysql_enable_utf8 => 1) into bytes by encoding them.

      print encode("UTF-8", "$row[0]\t$row[1]\n");
      or
      binmode(STDOUT, ':encoding(UTF-8)'); print "$row[0]\t$row[1]\n";
        I understand what you mean and I tried it, but it doesn't work! On the countrary if I leave my script untouched it properly works: html page source (as it is shown by firefox 2) shows multibyte characters as they are, that is it doesn't make use of the corresponding html entities. Sorry I don't understand why I should encode data again. Probably you mean that I should process data into binary if I send them back to the database.
        That is the un-encoded and encoded output:
          <!DOCTYPE html
        	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
        <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
        <head>
        <title>xxx</title>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        </head>
        <body>
        

        r.1,1 1 ʾà-da-umtúg-Ⅱ 1 aktumtúg 1 íb-ivtúg sa₆ dar

        r.1,2 NI-ra-arki

        r.1,3 2 ʾà-da-umtúg-ii 1 ʾà-da-umtúg-i

        and this if the encoded output:
        <!DOCTYPE html
        	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
        <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
        <head>
        <title>Progetto Sinleqiunnini</title>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        </head>
        <body>
        

        r.1,1 1 ʾà -da-umtúg-Ⅱ 1 aktumtúg 1 íb-ivtúg sa₆ dar

        r.1,2 NI-ra-arki

        r.1,3 2 ʾà -da-umtúg-ii 1 ʾà -da-umtúg-i

        Even if these lines are not intellegible (it is a III millennium b.C. lenguage) only the first example is correct.