in reply to Re^2: substitution regex and unicode
in thread substitution regex and unicode

You should be *encoding* what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by mysql_enable_utf8 => 1) into bytes by encoding them.

print encode("UTF-8", "$row[0]\t$row[1]\n");
or
binmode(STDOUT, ':encoding(UTF-8)'); print "$row[0]\t$row[1]\n";

Replies are listed 'Best First'.
Re^4: substitution regex and unicode
by frasco (Beadle) on May 07, 2008 at 18:40 UTC
    I understand what you mean and I tried it, but it doesn't work! On the countrary if I leave my script untouched it properly works: html page source (as it is shown by firefox 2) shows multibyte characters as they are, that is it doesn't make use of the corresponding html entities. Sorry I don't understand why I should encode data again. Probably you mean that I should process data into binary if I send them back to the database.
    That is the un-encoded and encoded output:
      <!DOCTYPE html
    	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
    <head>
    <title>xxx</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    </head>
    <body>
    

    r.1,1 1 ʾà-da-umtúg-Ⅱ 1 aktumtúg 1 íb-ivtúg sa₆ dar

    r.1,2 NI-ra-arki

    r.1,3 2 ʾà-da-umtúg-ii 1 ʾà-da-umtúg-i

    and this if the encoded output:
    <!DOCTYPE html
    	PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    	 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
    <head>
    <title>Progetto Sinleqiunnini</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    </head>
    <body>
    

    r.1,1 1 ʾà -da-umtúg-Ⅱ 1 aktumtúg 1 íb-ivtúg sa₆ dar

    r.1,2 NI-ra-arki

    r.1,3 2 ʾà -da-umtúg-ii 1 ʾà -da-umtúg-i

    Even if these lines are not intellegible (it is a III millennium b.C. lenguage) only the first example is correct.

      I never said you should encode data *again*. I said *characters* need to be encoded. Once a character is encoded, it becomes a series of bytes.

      I could comment elaborate, but it would help to know what I'm commenting about (i.e. see the changes you've made to your program).